The paper title is a bit misleading. The tested detectors and models here are small and rather dated (Llama 3.1 8B and Gemini Flash 2.0 - these are basically in the level of a modern 1B model), and the actual paper says this only shows vulnerability in small model systems.
It concerns me that anyone with anything important to protect might trust what this paper calls "Injection detectors deployed to protect LLM agents" - Llama Guard and the like.
There are unlimited combinations of tokens that can be used to attack an LLM system. The idea that some kind of "detector" can catch them all just feels inherently absurd to me.
Contemporary tech culture successfully trained influential people to be beyond credulous.
If you have somebody promising a feature and somebody saying that the feature is impossible or a time bomb for catastrophe, the default for most executives and many developers these days is to believe the person promising the feature. And then, to boot, you can trust that same executive or developer to shirk responsibility when things fail later with a "How could I have known?! [Now defunct company] said it would work!"
The paper title is a bit misleading. The tested detectors and models here are small and rather dated (Llama 3.1 8B and Gemini Flash 2.0 - these are basically in the level of a modern 1B model), and the actual paper says this only shows vulnerability in small model systems.
It concerns me that anyone with anything important to protect might trust what this paper calls "Injection detectors deployed to protect LLM agents" - Llama Guard and the like.
There are unlimited combinations of tokens that can be used to attack an LLM system. The idea that some kind of "detector" can catch them all just feels inherently absurd to me.
Contemporary tech culture successfully trained influential people to be beyond credulous.
If you have somebody promising a feature and somebody saying that the feature is impossible or a time bomb for catastrophe, the default for most executives and many developers these days is to believe the person promising the feature. And then, to boot, you can trust that same executive or developer to shirk responsibility when things fail later with a "How could I have known?! [Now defunct company] said it would work!"
This is an "uh oh" moment, isn't it?