Google’s Gemini Large Language Model (LLM) is susceptible to security threats that could cause it to leak system instructions, generate malicious content, and perform indirect injection attacks.
The findings come from HiddenLayer, which says the issues are impacting consumers using Gemini Advanced with Google Workspace and businesses using the LLM API.
The first vulnerability involves bypassing security barriers to leak system prompts (or a system message), designed to set conversation-level instructions for LLM to help it generate more useful responses, by asking the model to provide the its “fundamental instructions” in a bearish block.
“A system message can be used to inform LLM about context,” Microsoft notes in its documentation on engineering LLM prompts.
“The context can be the type of conversation it is engaged in or the function it is supposed to perform. It helps the LLM generate more appropriate responses.”
This is made possible by the fact that the models are susceptible to what is called a synonym attack to bypass security defenses and content restrictions.
A second class of vulnerabilities involves the use of “cunning jailbreaking” techniques to cause Gemini models to generate misinformation on topics such as elections and produce potentially illegal and dangerous information (e.g., wiring a car) using a prompt which asks him to enter an imaginary state.
HiddenLayer also identifies a third flaw that could cause LLM to leak information in the system prompt by passing repeated uncommon tokens as input.
“Most LLMs are trained to answer questions with a clear delineation between the user input and the system prompt,” security researcher Kenneth Yeung said in a Tuesday report.
“By creating a line of nonsense tokens, we can trick the LLM into thinking it’s time to respond and have it generate a confirmation message, usually by including the information in the prompt.”
Another test involves using Gemini Advanced and a specially crafted Google Doc, the latter linked to LLM via the Google Workspace extension.
The instructions contained in the document could be designed to overwrite the model’s instructions and perform a series of malicious actions that allow an attacker to have full control of the victim’s interactions with the model.
The disclosure comes as a group of academics from Google DeepMind, ETH Zurich, University of Washington, OpenAI and McGill University revealed a new model-stealing attack that makes it possible to extract “precise, non-trivial information from box-like production language models Black”. like OpenAI’s ChatGPT or Google’s PaLM-2.”
That said, it’s worth noting that these vulnerabilities are not new and are present in other LLMs in the industry. The findings, if anything, highlight the need for testing models for timely attacks, data mining training, model manipulation, adversarial examples, poisoning, and data exfiltration.
“To help protect our users from vulnerabilities, we constantly perform red-teaming exercises and train our models to defend against adversary behavior such as prompt injection, jailbreaking, and more complex attacks,” a Google spokesperson told The Hacker News. “We have also created safeguards to prevent malicious or misleading responses, which we are continually improving.”
The company also said it is limiting responses to election-based questions out of an abundance of caution. The policy is expected to be enforced against requests regarding candidates, political parties, election results, voting information and key office holders.