Turning speech into text has become so common that i’s a part of almost every smartphone. But have you ever thought about turning your speech into a portrait? Researchers have, and they’ve even made it possible.
Artificial intelligence scientists at MIT’S Computer Science and Artificial Intelligence Laboratory (CSAIL) have created AI that turns short snippets of audio speech recording into a human face. As if this weren’t both stunning and creepy enough, the results are actually fairly accurate, too!
The CSAIL researchers published a paper about their invention back in 2019. It’s an algorithm called, not surprisingly, Speech2Face, and the name says it all. In the demo, you can take a peek at how it works and what are the results. At the very top of the page, you’ll hear the audio snippets of different people speaking. Their real photo is just for your reference, and Speech2Face recreated their portrait based only on a three-second recording of their voice.
Interestingly enough, the AI seems to be working better when the audio clips are longer. The researchers have shared some examples of faces recreated from three versus six seconds of speech.
Of course, the results are still far from perfect, but they’re still amazing and eerily accurate. Still, the AI sometimes completely misses the point and mixes up the gender, age, and ethnicity of the subject:
Even though the algorithm was created for scientific purposes only, the question of privacy has been raised. The team claims that their method “cannot recover the true identity of a person from their voice,” i.e. recreate an exact image of their face.
“This is because our model is trained to capture visual features (related to age, gender, etc.) that are common to many individuals, and only in cases where there is strong enough evidence to connect those visual features with vocal/speech attributes in the data (see “voice-face correlations” below). As such, the model will only produce average-looking faces, with characteristic visual features that are correlated with the input speech. It will not produce images of specific individuals.”
However, if the algorithm becomes so sophisticated that it could recreate super-realistic faces, what impact could it have? The first thought that comes to my mind is that technology like this could be of immense help to police officers and detectives… Or I’m just looking too many crime TV shows. On the other hand, it could have a negative impact on YouTube and TikTok stars who’re trying to save their private life from followers so they only do voiceovers and don’t appear in front of the camera. But like every technology, I guess this one could be super-useful in good hands, and dangerous in bad ones.