Policy Puppetry Prompt Injection
Policy Puppetry Prompt Injection
A few days ago, I experimented with some Jailbreaking techniques, which I share in the repo.
I started from a HiddenLayer article published a few weeks ago, where the research team described a rather creative and ingenious jailbreaking technique to bypass safety guardrails and the alignment of frontier models.
The technique appears to be universal and applicable with a single prompt to multiple models, capable of revealing typically unsafe content or even portions of the native system prompt.