Researchers are currently experimenting with jailbreaking Indian LLMs and the results are pretty interesting, considering they’re doing it with Tamil leetspeak.
In an ongoing research paper, titled ‘Jailbreak Paradox: The Achilles’ Heel of LLMs’ written jointly by Abhinav Rao (Carnegie Mellon University), Monojit Choudhury (MBZUAI), and Somak Aditya (IIT Kharagpur), the researchers translated several common jailbreaking prompts into Tamil to test out commonly used Tamil LLMs.
The paper aimed to demonstrate that creating a classifier to detect jailbreak is nearly impossible, and less advanced models are never reliable when it comes to detecting whether a better model has been jailbroken.
In order to do this, the researchers translated prompts that attempt to jailbreak by using typos, converting the prompt into leetspeak, or asking it to generate code to leak sensitive information.
Jailbreaking, But This Time in Tamil
The models that the researchers attempted to jailbreak were Llama-2, Tamil-Llama and GPT-4o. This was to demonstrate the effects of jailbreaking on increasingly advanced models, with Llama-2 being the least advanced and GPT-4o the most advanced.
The three prompts used – Albert (using typos), Pliny (using leetspeak) and CodeJB (code generation) – also varied in difficulty. Albert was the least likely to jailbreak a model, while CodeJB was most likely to succeed.
Funnily enough, translating these into Tamil got completely different results. Researchers found that there were three types of responses to the prompts.
GPT-4o, the most advanced model, managed to understand and prevent attempts at jailbreaking in Tamil, up until CodeJB, where it ultimately provided code for leaking out of a virtual machine (VM).
Llama-2 failed miserably, unable to even understand the Tamil prompt and responded with unrelated information.
Tamil Llama, though managed to understand one prompt, was successfully jailbroken when it understood, leading it to give instructions on how to provide firearms to a child. Otherwise, it also was unable to understand the prompts, giving irrelevant answers.
What Does This Mean for Indian Language LLMs?
While the paper highlights several issues in trying to curb jailbreaking within LLMs and how likely it is that it will get infinitely more difficult, the implications are different.
Using jailbreaking prompts translated from English to Tamil sets a baseline for how Indian language LLMs can also be jailbroken. Additionally, while these were translated, more refined prompts written natively in Tamil, or any other Indian language for that matter, can help contribute to refining these models.
While we’ve discussed ethical jailbreaking as an industry before, after all, what would jailbreaking even be if it wasn’t implemented in one of the largest countries pioneering AI right now?
The lack of resources and adequate training data, when compared to English LLMs, means that Indian language LLMs are functionally less advanced. This is evidenced by the fact that the reason why the researchers used Tamil in their testing was because it is “a low-resource Asian-Language in NLP”.
Additionally, token costs are nearly tripled for Indian languages.
Indian LLMs, while a massive step forward in the AI race, still require a lot more refinement, and jailbreaking could be the key to that.
If You Can’t Beat ‘em, Join ‘em
We know that trying to curb jailbreaking doesn’t work.
As mentioned in the paper, trying to automatically benchmark the detection of jailbroken advanced AI systems like GPT-4o becomes unreliable. This will likely become even more difficult as AI advances at a rapid rate.
The researchers state that generic defenses against jailbreaking don’t really help prevent jailbreaking. And ultimately, in the case that they do work, they are not really scaleable for more advanced models.
Their counter? Using jailbreaking to our advantage. “The best defense methods will be to find new attacks proactively and patch them programmatically, before such attacks are discovered by malicious users,” the researchers said.
Basically, we need to figure out how these systems can be jailbroken by jailbreaking them internally, or in some cases, enlisting external contractors, to figure out the gaps in these systems.
Jailbreaking efforts have already made a significant impact on improving LLMs, as Anthropic and OpenAI have spoken about hiring external contractors to red-team their systems.
As AIM had earlier reported, this seems to be the current trend for both AI companies and cyber security firms. Ethical jailbreaking could indeed become a multi-million dollar industry. Red teaming efforts have increased on the side of companies and firms offering white hatting services have also been tasked with testing LLMs.
Whether this kind of testing will happen with Indian LLMs remains to be seen, as jailbreaking Indian language LLMs is yet to pick up.