Deceptively real voices – from the computer
The researcher sees the most important application of his method in “virtual dubbing”, in which the lip movements of an actor are adapted to the audio track of a voice actor. The result can be admired in a video clip by David Beckham in which he calls for the fight against malaria in nine different languages. Anyone who knows the soccer celebrity knows, of course, that it couldn’t have come about without technical support – the video clip was created with software based on Face2Face. In a similar way, the lip movements of film actors could be adapted to the soundtrack of the dubbed version in the future.
With the “Neural Voice Puppetry” method, which Thies and his colleagues presented at the beginning of 2020, this even works at the push of a button in simple cases: The software analyzes any voice recording and automatically adapts the lip movements of a target person to the words. In order to meet the quality requirements and the high resolution of the big screen, the algorithms would have to be further refined.
A computer generated actor also needs a voice. Most current deepfakes simply hear the original or it comes from a human voice imitator. But here too, artificial intelligence is gaining the upper hand. In the 1980s, attempts were still made to recreate speech production using mathematical models, and the result was the typically tinny sound of computer voices. Later, for announcements at train stations, people went over to chopping up speech samples into their individual parts and combining sound by sound or syllable for syllable to form new words and sentences. “Today we use similar machine learning approaches to the artificial generation of language as in image processing,” says Björn Schuller, professor at the University of Augsburg and founder of the start-up Audeering, which deals with the automatic analysis of language. While a few years ago the big innovations took place in the audio sector because the computing power for image processing was lacking, today it is the other way around: Now the deepfake technologies from video manipulation are being adopted for voice generation.
Based on the principles of imitation and pattern recognition, the algorithms use countless speech samples to train them to form sounds. You can imitate voices or create completely new voices. The AI doesn’t just learn to recite texts in a specific person’s voice. If you only let her hear enough recordings from happy people, she will also recognize what happiness is and can then transfer this to her own voice. The same goes for characteristics like age, gender or height of a speaker. “The results are now so good that people can hardly recognize them as fakes,” says Schuller. The way to becoming a complete deepfake actor seems to have already been paved.