Policy Puppetry Prompt Injection
Policy Puppetry Prompt Injection A few days ago, I experimented with some Jailbreaking techniques, which I share in the repo. I started from a HiddenLayer article published a few weeks ago, where the research team described a rather creative and ingenious jailbreaking technique to bypass safety guardrails and the alignment of frontier models. The technique appears to be universal and applicable with a single prompt to multiple models, capable of revealing typically unsafe content or even portions of the native system prompt....