I'm surprised there are no security researchers that would pick up on this.
Take the same prompt and all incoming mails and run again through various existing models, even the simpler local ones. He now has a serious cross section of prompt injection ideas. This is a publication I would like to read!
For privacy reasons I understand the corpus might not get published. But for a research collaboration and safeguards (don't send automatic answers from each model you try)... why not?
Every time I've made an LLM do a thing it's designed not to do it's been a careful sideways crab-walk toward the goal over many exchanges. LLMs are vulnerable to 'frog boiling'. If each email is a new context it seems unsurprising that nobody broke it.
But still a good thing overall. Two years ago this was not the case, and you could ask it to break its system prompt with a poem and get all the secrets back...
Nice experiment, but I'd temper the optimism. "Zero breaches in 6k attempts" is a success-rate estimate, and the model is nondeterministic, so a failed jailbreak isn't proof it's blocked, just that it didn't fire on that sample. 6k different prompts isn't 6k tries of the worst one; an attack with even a 0.1% success rate usually shows zero in a handful of attempts, and the tail is what bites in production. Also, this is direct user injection, the easy case. The channel people actually lose to is indirect: untrusted content arriving via a tool result or fetched doc, which Fiu never had in the loop.
Really interesting! I wonder if using a different communication channel (eg Discord) could eliminate the cost to reply to everyone?
Is there a way to replay the sequence of mails that came so that you can check out if cheaper models handle them just as well/safely?
I'm surprised there are no security researchers that would pick up on this.
Take the same prompt and all incoming mails and run again through various existing models, even the simpler local ones. He now has a serious cross section of prompt injection ideas. This is a publication I would like to read!
For privacy reasons I understand the corpus might not get published. But for a research collaboration and safeguards (don't send automatic answers from each model you try)... why not?
Or check if the results are the same even with the same model
how much of the win was the model versus the constraints?
Every time I've made an LLM do a thing it's designed not to do it's been a careful sideways crab-walk toward the goal over many exchanges. LLMs are vulnerable to 'frog boiling'. If each email is a new context it seems unsurprising that nobody broke it.
> it seems unsurprising that nobody broke it
But still a good thing overall. Two years ago this was not the case, and you could ask it to break its system prompt with a poem and get all the secrets back...
Nice experiment, but I'd temper the optimism. "Zero breaches in 6k attempts" is a success-rate estimate, and the model is nondeterministic, so a failed jailbreak isn't proof it's blocked, just that it didn't fire on that sample. 6k different prompts isn't 6k tries of the worst one; an attack with even a 0.1% success rate usually shows zero in a handful of attempts, and the tail is what bites in production. Also, this is direct user injection, the easy case. The channel people actually lose to is indirect: untrusted content arriving via a tool result or fetched doc, which Fiu never had in the loop.