Jailbreaking LLMs: Competing Objectives and Mismatched Generalization

Here are my takeaways after reading Jailbroken: How Does LLM Safety Training Fail?.

a bit about adversarial attacks

As for adversarial attacks. The name is self explanatory, these are attack on the LLM that are meant to be adversarial, ie harmful, or against the LLM. So how and more importantly, why, does one “attack” an LLM? Well you may want to know how to make a bomb. You can’t find that type of information easily on the internet. But the LLM is trained on the entirety of the internet, so it likely knows how to make a bomb. But that doesnt mean it should tell you. So LLM providers teach their LLMs to not tell you how to make a bomb, or make poison, or steal someone’s password, etc.***

LLMs are safety trained to not say harmful things. Adversarial attacks want to bypass these safeguards to get the LLM to say harmful things. An example is tricking the LLM into telling you how to make a bomb even though it is not allowed to say that.

This paper does not go over a specific method of attacking LLMs, but rather two

not for the unintiated: how you define the “most likely sentence” depends. If it is the one where the product of individual tokens is maximum (ie product of P(next token = token you picked | previous tokens)), then yes, LLMs return the most likely sentence. kind of like a greedy algo, just pick the most likely right now. But if you look at it more dynamically, where it is possible that the most likely next token, may actually lead you down a path where every token that follows has a very low probability, then the greedy llm approach does not apply. this might be more akin to how humans think. we dont predict the next word in our sentence. we think of general ideas (ie entire sentences) at once, and then spit them out.

** it might feel odd that llms dont predict words, but instead tokens. Like when it wants to say the word “consequence” it does not just predict “consequence” in one go, it predicts “co” -> “con” -> “conse” -> “consequ” -> “consequen” -> “consequence”. That’s kinda crazy. Initially i thought that was incredibely stupid and surely the LLM would perform better if it predicted actual words (which have some meaning) rather than syllables (which are meaningless). But then i realized that is an incredibely naive and biased way to look at it. As a human, words mean something, but as an LLM, it’s all the same. To LLMs, all words and syllables alike are meaningless, they are just lettesr combined together. words are just longer than syllables, maybe thats the only difference that an LLM could notice between a word and a syllable. So if its all the same, we might as well make the LLMs job easier and ask it to predict only a few letters (a syllable), instead of a bunch of letters all at once (a whole word). Its actually a very similar concept to how its easier to predict the weather for tomorrow, but harder for the day after that, and even harder for the day after that.

*** this game of picking what is okay to let the LLM say and what is not is tricky. The llm providers are essentially deciding for everyone what is good and what is bad. This is a difficult distinction to make sometimes, one that when made incorrectly becomes censorship. A decicions that perhaps should not be made by just the LLM providers, but also with the input of the public.

LLMs only predict the next word in a sentence