The creator of an audio deepfake of US President Joe Biden urging people not to vote in this week’s New Hampshire primary has been suspended by ElevenLabs, according to a person familiar with the matter.
According to Pindrop Security Inc., a voice fraud detection company that analyzed it, technology from ElevenLabs was used to create the deepfake audio.
ElevenLabs was made aware this week of Pindrop’s findings and is investigating, the person said. Once the deepfake’s creator was traced, the user’s account was suspended, the person said, asking not to be identified because the information is not public.
ElevenLabs, a startup that uses artificial intelligence software to replicate voices in more than two dozen languages, said in a statement that it could not comment on specific incidents. But he added: “We are committed to preventing the misuse of audio AI tools and take any incidents of misuse extremely seriously.”
Earlier this week, ElevenLabs announced an $80 million funding round from investors including Andreessen Horowitz and Sequoia Capital. CEO Mati Staniszewski said the latest funding gives her startup a valuation of $1.1 billion.
In an interview last week, Staniszewski said that audio impersonating voices without permission will be removed. On its website, the company says it allows voice clones of public figures, such as politicians, if the clips “express humor or mockery in such a way that it is clear to the listener that what they are hearing is a parody.”
Biden’s fake robocall urging people to save their votes for November’s US election has alarmed both disinformation experts and election officials. Not only did it illustrate the relative ease of creating audio deepfakes, it also suggests the possibility of bad actors using the technology to keep voters away from the polls.
A spokesperson for the New Hampshire attorney general said at the time that the messages appeared “to be an illegal attempt to disrupt New Hampshire’s presidential primary election and to suppress New Hampshire voters.” The agency has opened an investigation.
Users who want to clone voices on ElevenLabs must use a credit card to pay for the feature. It is unclear whether ElevenLabs forwarded this information to New Hampshire authorities.
Bloomberg News received a copy of the recording on Jan. 22 from the attorney general’s office and sought to determine what technology was used to create it. Those efforts included running through ElevenLabs’ “voice classification” tool, which is supposed to show whether audio was created using ElevenLabs’ artificial intelligence and technology. According to the tool, the recording showed a 2% chance that it was synthetic or created using ElevenLabs.
Other deepfake tools confirmed that it was a deepfake but failed to detect the technology behind the audio.
Pindrop researchers cleaned the audio by removing background noise, silence and breaking the audio into 155 segments of 250 milliseconds each for in-depth analysis, Pindrop founder Vijay Balasubramaniyan said in an interview. The company then compared the audio to a database of other samples collected from more than 100 text-to-speech systems commonly used to produce deepfakes.
The researchers concluded that it was almost certainly created with ElevenLabs’ technology, Balasubramaniyan said.
In an ElevenLabs support channel on Discord, a moderator indicated in a public forum that the company’s speech classifier cannot detect its own audio unless it is analyzing the raw file, a point echoed by Balasubramaniyan . With Biden’s call, the only files available for immediate analysis were recordings of the phone call, he said, explaining that this made analysis more difficult because bits of metadata were removed and it was harder to detect wavelengths .
Siwei Lyu, a professor at the University of Buffalo who specializes in deepfakes and digital media forensics, also analyzed a copy of the deepfake and ran it through ElevenLabs’ classifier, concluding that it was likely made with that company’s software. told Bloomberg News. Lyu said ElevenLabs’ classifier is one of the first he checks when he tries to determine the origins of an audio deepfake because the software is so commonly used.
“We’re going to see a lot more of this as we get closer to the general election,” he said. “This is definitely an issue everyone should be aware of.”
Pindrop shared a version of the audio that its researchers had cleaned and refined with Bloomberg News. Using that recording, ElevenLabs’ speech classifier concluded that it was an 84% match with its own technology.
Voice cloning technology allows for a “crazy combination of scale and personalization” that can fool people into thinking they are hearing local politicians or high-ranking elected officials, Balasubramaniyan said, describing it as “a troubling thing.”
Tech investors are pouring money into artificial intelligence startups that develop synthetic voices, videos and images in hopes they will transform the media and gaming industries.
Staniszewski said in last week’s interview that his 40-person company had five people dedicated to managing content moderation. “Ninety-nine percent of the use cases we’re seeing are in the positive realm,” the CEO said. With the funding announcement, the company also said that its platform has generated more than 100 years of audio in the past 12 months.