EMO AI Alibaba: An AI tool that converts photos into talking and singing videos.

Alibaba's Institute for Intelligent Computing has introduced EMO, an AI system that generates lifelike videos of individuals speaking or singing using audio waveforms. EMO's performance surpasses existing methodologies and accurately synchronizes lip movements, leading to speculation about Alibaba's incorporation of the tool.

By Mahammad Rafi 4 Min Read

Researchers at Alibaba’s Institute for Intelligent Computing have unveiled an artificial intelligence system called ‘EMO’, short for Emote Portrait Alive. The AI tool’s name indicates that it animates individual portrait photos, creating lifelike videos of people speaking or singing

Unlike traditional methods that rely on 3D face models or blend shapes, EMO takes a direct audio-to-video synthesis approach.

The process of converting audio waveforms into video frames allows for the capture of subtle facial motions and identity-specific nuances associated with natural speech.

In a research paper, the researchers at Alibaba explained how they trained the model. “We constructed a vast and diverse audio-video dataset, amassing over 250 hours of footage and more than 150 million images. Furthermore, the expansive dataset encompasses a wide range of content, including speeches, film clips, and singing performances, and covers multiple languages such as Chinese and English. the researchers said that the rich variety of speaking and singing videos ensures that the training material captures a broad spectrum of human expressions and vocal styles, providing a solid foundation for the development of EMO.

“Experimental results demonstrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism,” the paper noted.

Having said that, the researchers admitted that their method has some limitations. First, it is more time-consuming compared to methods that do not rely on diffusion models. Second, since the model does not use any explicit control signals to control the character’s motion, it may result in the inadvertent generation of other body parts, such as hands, leading to artefacts in the video.

However, the results that were shared by the researchers are fairly accurate. AI tool gets the lip-sync also spot on. It will be interesting to see if Alibaba incorporates the tool into its AI or if it remains a research project only.

Also fascinating are the little embellishments between phrases — pursed lips or a downward glance — that insert emotion into the pauses rather than just the times when the lips are moving. It’s fascinating to see how EMO captures the expressions of real human faces, even in such a short demo.

According to the paper, EMO’s model relies on a large dataset of audio and video (once again: from where?) to provide the reference points necessary to emote so realistically. And its diffusion-based approach apparently doesn’t involve an intermediate step in which 3D models perform some of the work. A reference-attention mechanism and a separate audio-attention mechanism are paired by EMO’s model to provide animated characters whose facial animations match what comes across in the audio while remaining true to the facial characteristics of the provided base image.

It’s an impressive collection of demos, and after watching them, it’s impossible not to imagine what’s coming next. But if you make your money as an actor, try not to imagine too hard, because things get pretty disturbing pretty quickly.

Share This Article
Exit mobile version