ChatGPT Reveals Secrets in New PoC Attack

A team of researchers from Google DeepMind, Open AI, ETH Zurich, McGill University, and the University of Washington have developed a new attack to extract key architectural information from proprietary large language models (LLMs) such as ChatGPT and Google PaLM-2.

Research shows how hackers can extract supposedly hidden data from an LLM-enabled chatbot so they can duplicate or completely steal its functionality. The attack – described in a technical report published this week – is one of many who over the past year have highlighted weaknesses that AI tool makers still face in their technologies, even as adoption of their products is surging.

As the researchers behind the new attack note, little is known publicly about how large language models like GPT-4, Gemini, and Claude 2 work. The developers of these technologies have deliberately chosen to withhold key details about the data in their models of training, on the training method and on the decision-making logic for competition and safety reasons.

“However, although the weights and internal details of these models are not publicly accessible, the models themselves are exposed via API,” the researchers noted in their paper. Application programming interfaces allow developers to integrate AI-enabled tools like ChatGPT into their own applications, products and services. The APIs allow developers to leverage AI models such as GPT-4, GPT-3, and PaLM-2 for different use cases such as creating virtual assistants and chatbots, automating business process workflows, generating content, and responding to domain-specific content.

Researchers at DeepMind, OpenAI and other institutions wanted to find out what information they could extract from AI models by querying through its API. Unlike a previous attack in 2016 where researchers showed how they could extract model data By executing specific instructions at the first layer, or input layer, the researchers opted for what they described as a “top-down” attack model. The goal was to see what they could extract by running targeted queries on the last layer of the neural network architecture responsible for generating output predictions based on the input data.

An attack from above

Information in this layer can include important clues about how the model handles input data, transforms it, and runs it through a complex series of processes to generate a response. Attackers who are able to extract information from this so-called “embedded projection layer” can gain valuable insights into the inner workings of the model so they can create more affective attacks, reverse engineer the model, or try to subvert its behavior.

Successful attacks at this level can reveal “the width of the transformer model, which is often related to the total parameter count,” the researchers said. “Secondly, it slightly reduces the degree to which the model is a complete ‘black box’, which could therefore be useful for future attacks.”

The researchers found that by attacking the last layer of many large LLMs they were able to extract substantial proprietary information about the models. “For less than $20, our attack extracts the entire projection matrix of OpenAI’s ada and babbage language models,” the researchers wrote. “We also retrieve the exact hidden dimension of the gpt-3.5-turbo model and estimate that it would cost less than $2,000 in queries to retrieve the entire projection matrix.”

The researchers described their attack as having managed to recover a relatively small portion of the targeted AI models. But “the fact that it is possible to steal any parameter of a production model is surprising and raises concerns that extensions of this attack may be able to recover more information.”

Over the past year there have been numerous other reports highlighting the weaknesses of popular GenAI models. Earlier this month, for example, researchers at HiddenLayer published a report describing how they managed to obtain Google’s Gemini technology performs poorly in various ways by sending him carefully structured suggestions. Others have found similar approaches to ChatGPT jailbreak and make it generate content that it shouldn’t generate. And in December, researchers at Google DeepMind and elsewhere showed how they could extract it ChatGPT hidden training data simply by pushing him to repeat certain words incessantly.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *