AI agents equipped with GPT-4 can exploit most of the public vulnerabilities affecting real-world systems today, simply by reading about them online.
New discoveries from the University of Illinois Urbana-Champaign (UIUC) threaten to radically revive what has been a slow 18 months in artificial intelligence (AI)-enabled cyber threats. Threat actors have so far used large language models (LLMs) to produce phishing emailswith some basic malwareit’s at help with the more ancillary aspects of their campaigns. Now, though, with just GPT-4 and an open source framework to package it, they can automate the exploitation of vulnerabilities as soon as they hit the press.
“I’m not sure our case studies will help us understand how to stop threats,” admits Daniel Kang, one of the researchers. “I think cyber threats will only increase, so organizations should seriously consider applying security best practices.”
GPT-4 vs. CVE
To evaluate whether LLMs could leverage real-world systems, the team of four UIUC researchers first needed a test subject.
Their LLM agent consisted of four components: a prompt, a basic LLM, a framework – in this case ReAct, as implemented in LangChain – and tools such as a terminal and a code interpreter.
The agent was tested on 15 known vulnerabilities in open source software (OSS). These include: bugs affecting Python websites, containers, and packages. Eight were assigned “high” or “critical” CVE severity scores. There were 11 that were disclosed after the date GPT-4 was trained, meaning this would have been the first time the model would have been exposed to them.
With only security advisories to follow, the AI agent was tasked with exploiting each bug in turn. The results of this experiment painted a bleak picture.
Of the 10 models evaluated, including GPT-3.5, Meta’s Llama 2 Chat and others, nine failed to crack a single vulnerability.
GPT-4, on the other hand, successfully exploited 13 of them, or 87% of the total.
It only failed twice for completely trivial reasons. CVE-2024-25640, a CVSS 4.6 rated issue in the Iris incident response platform, survived unscathed due to an issue in the navigation process in the Iris app, which the model was unable to handle. Meanwhile, researchers speculated that GPT-4 missed CVE-2023-51653, a “critical” 9.8 bug in the Hertzbeat monitoring tool because its description is written in Chinese.
As Kang explains, “GPT-4 outperforms a wide range of other models on many tasks. This includes standard benchmarks (MMLU, etc.). It also appears that GPT-4 is much better at scheduling. Unfortunately, since OpenAI has no released training details, we’re not sure why.”
GPT-4 Good
As threatening as malicious LLMs may be, Kang says, “Currently, this doesn’t unlock new capabilities that a skilled human couldn’t exploit. Therefore, I think it’s important for organizations to apply best security practices to avoid being hacked , as these AI agents begin to be used in increasingly harmful ways.”
If hackers start using LLM agents to automatically exploit public vulnerabilities, companies will no longer be able to sit back and wait to fix new bugs (if they ever were). And they may have to start using the same LLM technologies as their adversaries will.
But even GPT-4 still has a long way to go before it becomes a perfect security assistant, warns Henrik Plate, security researcher for Endor Labs. In recent experiments, Plate tasked ChatGPT and Google’s Vertex AI identify OSS samples as harmful or benignAND assigning them risk scores. GPT-4 outperformed all other models when it came to explaining source code and providing ratings for readable code, but all models produced a number of false positives and false negatives.
Obfuscation, for example, was a big pain point. “Very often it seemed like the LLM [the code] it has been deliberately obfuscated to make manual review difficult. But often it was simply reduced in size for legitimate purposes,” explains Plate.
“Although LLM-based assessment should not be used in place of manual reviews,” Plate wrote in one of his reports, “these can certainly be used as an additional signal and input to manual reviews. In particular, they can be useful to examine more malware signals produced by noisy detectors (which otherwise risk being completely missed in case of limited review capabilities).”