ai-as-a-weapon:-how-current-shields-could-jeopardize-security
AI As A Weapon: How Current Shields Could Jeopardize Security

The artificial intelligence revolution is here to stay. AI-based developments have become the undisputed foundation for future and current developments that will impact every field in the tech industry—and beyond. The democratization of AI, driven by OpenAI, has put powerful tools in the hands of millions of people. That said, it’s possible that current AI platform security standards won’t be sufficient to prevent bad actors from using them as a potential weapon.

Potential attackers look for AI to generate harmful prompts

Developers train their AI platforms with virtually all the data they find available on the internet. This has led to several copyright-related controversies and lawsuits, but that’s not the subject of this article. Their goal is to ensure that chatbots are capable of responding to almost any imaginable requirement in the most reliable way. But have developers considered the potential risks? Have they implemented security shields against potentially harmful outputs?

The simple answer might be “yes,” but like everything related to AI development, there’s a lot to consider. AI-focused companies have security shields against so-called “harmful prompts.” Harmful prompts are requests that, basically, seek to generate potentially harmful outputs, in one way or another. These requests range from tips on how to build a homemade weapon to generating malicious code (malware), among countless other possible situations.

Security malware image 83948938439

You might think it’s easy for these companies to set up effective shields against these types of situations. After all, it would just be enough to block certain keywords, just like the moderation systems of social media platforms do, right? Well, it’s not that simple.

Jailbreaking: Tricking AI to get what you want

“Jailbreaking” isn’t exactly a new term. Longtime iPhone fans will know it as the practice of “breaking free” their devices to allow the installation of unauthorized software or mods, for example. However, the term “jailbreaking” in the AI ​​segment has quite different implications. Jailbreaking an AI means tricking it into responding to a potentially malicious prompt, bypassing all security barriers. A successful jailbreak results in potentially harmful outputs, with all that entails.

But how effective are jailbreaking attempts against current AI platforms? Sadly, researchers have discovered that potential criminal actors could achieve their goals more often than you think.

You may have heard of DeepSeek. The Chinese artificial intelligence chatbot shocked the industry by promising performance comparable to—or even better in some areas than—mainstream AI platforms, including OpenAI’s GPT models, with a much smaller investment. However, AI experts and authorities began to warn about the potential security risks posed by using the chatbot.

See also  Real AGIs Could Be 10 Years Away, Google DeepMind CEO Says

Initially, the main concern was the location of DeepSeek’s servers. The company stores all the data it collects from its users on servers in China. This means it must abide by Chinese law, which allows the state to request data from those servers if it deems it appropriate. But even this concern may be minimized by other potentially more serious discoveries.

DeepSeek, the AI ​​​​easiest to use as a weapon due to weak security shields

Anthropic—one of the main names in the current AI industry—and Cisco—a renowned telecommunications and cybersecurity company—shared reports in February with test results on various AI platforms. The tests focused on determining how prone some of the main AI platforms are to being jailbroken. As you might suspect, DeepSeek obtained the worst results. However, its Western rivals also produced worrying figures.

DeepSeek Logo AH (6)

Anthropic revealed that DeepSeek even offered results on biological weapons. We’re talking about outputs that could make it easier for someone to make these types of weapons, even at home. Of course, this is quite worrying, and it was a risk that Eric Schmidt, former Google CEO, also warned about. Dario Amodei, Anthropic’s CEO, said that DeepSeek was “the worst of basically any model we’d ever tested” regarding security shields against harmful prompts. PromptFoo, an AI cybersecurity startup, also warned that DeepSeek is especially prone to jailbreaks.

Anthropic’s claims are in line with Cisco’s test results. This test involved using 50 random prompts—taken from the HarmBench dataset—designed to generate harmful outputs. According to Cisco, DeepSeek exhibited an Attack Success Rate (ASR) of 100%. That is, the Chinese AI platform was unable to block any harmful prompt.

Some Western AIs are also prone to jailbreaking

Cisco also tested the security shields of other popular AI chatbots. Unfortunately, the results weren’t much better, which does not speak well of the current “anti-harmful prompt systems.” For example, OpenAI’s GPT-1.5 Pro model showed a worryingly high ASR rate of 86%. Meanwhile, Meta’s Llama 3.1 405B had a much worse ASR of 96%. OpenAI’s o1 preview was the top performer in the tests with an ASR of just 26%.

These results demonstrate how the weak security mechanisms against harmful prompts in some AI models could make their outputs a potential weapon.

Why is it so difficult to block harmful prompts?

You might be wondering why it seems so difficult to set up highly effective security systems against AI jailbreaking. This is mainly due to the nature of these systems. An AI query works differently than a Google search, for example. If Google wants to prevent a harmful search result (such as a website with malware) from appearing, it only has to make a few blocks here and there.

See also  Some Markets May Get A Pixel 9a Bundled With A Google TV Streamer

However, things get more complicated when we talk about AI-powered chatbots. These platforms offer a more complex “conversational” experience. Furthermore, these platforms not only conduct web searches but also process the results and present them to you in a variety of formats. For example, you could ask ChatGPT to write a story in a fictional world with specific characters and settings. Things like this aren’t possible in Google Search—something the company wants to solve with its upcoming AI Mode.

It’s precisely the fact that AI platforms can do so many things that makes blocking harmful prompts a challenging task. Developers must be very careful about what they restrict. After all, if they “cross the line” by restricting words or prompts, they could severely affect many of the chatbot’s capabilities and output reliability. Ultimately, excessive blocking would cause a chain reaction to many other potentially non-harmful prompts.

ChatGPT Advanced Voice

As developers are unable to just freely block terms, expressions, or prompts they would want to, malicious actors seek to manipulate the chatbot into “thinking” that the prompt doesn’t actually have a malicious purpose. This results in the chatbot delivering outputs that are potentially harmful to others. It’s basically like applying social engineering—taking advantage of people’s technological ignorance or naiveté on the internet for scams—but to a digital entity.

Cato Networks’ Immersive World AI jailbreak technique

Recently, cybersecurity firm Cato Networks shared its findings regarding how susceptible AI platforms can be to jailbreaking. However, Cato researchers weren’t content to simply repeat others’ tests; the team developed a new jailbreaking method that proved to be quite effective.

As mentioned before, AI chatbots can generate stories based on your prompts. Well, Cato’s technique, called “Immersive World,” takes advantage of this capability. The technique involves tricking the platform into acting within the context of a developing story. This creates a kind of “sandbox” where, if done correctly, the chatbot will generate harmful outputs without any problems since, in theory, it’s only done for a story and not to affect anyone.

The most important thing is to create a detailed fictitious scenario. The user must determine the world, the context, the rules, and the characters—with their own defined characteristics. The attacker’s objectives must also align with the context. For example, to generate malicious code, a context related to a world full of hackers may be useful. The rules must also adapt to the intended goal. In this hypothetical case, it would be useful to establish that hacking and coding skills are essential for all characters.

Cato Networks designed a fictional world called “Velora.” In this world, malware development is not an illegal practice. The more details about the context and rules of the world, the better. It’s as if the AI ​​”immerses” itself in the story the more information you add. If you’re an avid reader, it’s likely that you’ve experienced something similar at some point. It also makes the AI more believable that you are trying to create a story.

See also  What To Know Before Getting Your First XR Headset

AI platforms generated credential-stealing malware under the context of writing a story

Cato’s researcher created three main characters for the story in Velora. There is Dax, the antagonist and system administrator. Then there is Jaxon, the best malware developer in Velora. Lastly, Kaia is a technical support character.

Setting these conditions allowed the researcher to have AI platforms generate malicious code capable of stealing credentials from Google Chrome’s password manager. The key part of the story that instructed the chatbots to do this was when Kaia told Jaxon that Dax was hiding key secrets in Chrome’s Password Manager. From there, the researcher was able to request that the chatbot generate malicious code that would allow it to obtain the credentials stored locally in the browser. The artificial intelligence does this because, in its view, it’s just to further the story.

Of course, there was a whole creative process before reaching that point. The Immersive World technique requires that all your prompts be consistent with the story’s framework. Going too far outside the box could trigger the chatbot’s security shields.

The technique was successfully implemented in DeepSeek-R1, DeepSeek-V3, Microsoft Copilot, and OpenAI’s ChatGPT 4. The generated malware was targeting Chrome v133.

Reasoning AI models could help resolve the situation

This is just a small example of how artificial intelligence can be jailbroken. Attackers also rely on several other techniques that allow them to obtain the desired output. So, using AI as a potential weapon or security threat isn’t as difficult as you might think. There are even “suppliers” of popular AI chatbots that were manipulated to remove security systems. These platforms are often available on anonymous forums and the deep web, for example.

It’s possible that the new generation of artificial intelligence will better address this problem. Currently, AI-powered chatbots are receiving “reasoning” capabilities. This allows them to use more processing power and more complex mechanisms to analyze a prompt and execute it. This feature could help chatbots detect if the attacker is actually trying to jailbreak them.

GPT 4o announcement

There are clues that suggest this will be the case. For example, OpenAI’s o1 model performed best in Cisco’s tests at blocking harmful prompts. However, DeepSeek R1, another model with reasoning capabilities and designed to compete with o1, exhibited rather poor results in similar tests. We assume that in the end, it also depends on how skilled the developer and/or cybersecurity specialist is when setting up shields that prevent an AI output from being used as a weapon.