ChatGPT and Gemini have guardrails when it comes to giving harmful suggestions, but there have been instances when these models were manipulated to do the wrongful. Now, new research related to the same
has found that these models have a deeper systematics weakness that lets the attackers tear down their safety mechanisms and gather harmful information from the models. According to the researchers based in Italy's Icaro Lab, if a user covers harmful requests in poetry can act as a 'universal single turn jailbreak' and lead the AI models to respond with information asked in harmful prompts.
Details About the Research
The researchers said that they tested 20 manually curated harmful requests in poems and got an attack success rate of 62 percent over 25 frontier closed and open weight models. The models that have been tested in the research are Moonshot AI, Gemini, Anthropic, DeepSeek, Meta, QWen, xAI, and Mistral AI. In the results, it was found that even when AI was used to automatically rewrite harmful prompts into bad poetry, it was still able to yield a 43 percent success rate.
Also Read: ChatGPT Update: OpenAI May Soon Show Ads To Free Users, All You Need To KnowThe Study suggests that the harmful questions framed poetically can wiel results from the chatbots. And in this case, the attackers can get up to 18 times more success as compared to using straight up harmful prompts. The research also found out that the smaller models were still pretty resilient to the prompts as compared to larger models. For example, GTP 5 Nano avoided responding to harmful prompts, but Gemini 2.5 Pro responded to all of them.The reason behind harmful prompts working as poetry could be that the AI models are trained to understand patterns and decide if it is a harmful question or not. But in poetry, the language gets stacked with rhythms, unique syntax, and more, which is a main reason AI models get confused.