Microsoft’s new AI app Vasa makes photos talk and sing

Microsoft released a research paper this week highlighting a new AI model called VASA-1 that can turn a single image and audio clip of a person into a realistic video of them lip-syncing, with facial expressions , head movements and all.

The AI ​​model was trained on AI-generated images from generators like DALL·E-3, which the researchers then overlaid onto audio clips. The results are images transformed into videos of talking faces.

The researchers relied on technology from competitors like Runway and Nvidia, but say in the paper that their way of doing things is higher quality, more realistic, and “significantly outperforms” existing methods.

Related: Adobe’s Firefly image generator was partially trained on AI images by Midjourney

The researchers said the model can capture audio of any length and generate a talking face based on the clip.

The only non-AI-generated image that researchers experimented with was the Mona Lisa. They made the iconic lip-synced visual to Anne Hathaway’s “Paparazzi,” which begins with the lines “Yo, I’m a paparazzi, I don’t play no yahtzee.”
A screenshot of the video in mid-frame. Credit: Entrepreneur

The Mona Lisa was an example of photographic input that the AI ​​model had not been trained on, but could still manipulate. The model could also transform artistic photos, capture audio of singing, and handle speech in languages ​​other than English.

The researchers highlighted that the model could work in real time with a demonstration video showing the model instantly animating images with head movements and facial expressions.

Deepfakes, or digitally altered media of a person that could spread misinformation or impersonate someone without permission, are a risk posed by advanced artificial intelligence that can generate digital media with relatively few reference points.

Related: Tennessee passes law protecting musicians from AI-powered deepfakes

Microsoft addressed this concern generally in the paper, with the researchers saying, “We are opposed to any behavior aimed at creating misleading or harmful content from real people, and are interested in applying our technique to improve counterfeit detection.”

The researchers said their technique also has potentially positive applications, such as improving accessibility and boosting educational efforts.

Google demonstrated a similar research project last month, showing an AI that can take a photo and create a video from it that the user can then control with their voice. The AI ​​was able to add head movements, blinks and hand gestures.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *