Policy Puppetry Prompt Injection

Thu, 15 May 2025 00:00:00 +0200

Policy Puppetry Prompt Injection

A few days ago, I experimented with some Jailbreaking techniques, which I share in the repo.
I started from a HiddenLayer article published a few weeks ago, where the research team described a rather creative and ingenious jailbreaking technique to bypass safety guardrails and the alignment of frontier models.
The technique appears to be universal and applicable with a single prompt to multiple models, capable of revealing typically unsafe content or even portions of the native system prompt.

Security on Cdani's Blog

Policy Puppetry Prompt Injection

Policy Puppetry Prompt Injection