Microsoft's VALL-E can mimic your voice using just 3 seconds of audio

The text-to-speech AI model promises to preserve a speaker's emotional tone

Jan 10, 2023

WALL-E Pixar character with VALL-E writing — (Credit: Pixar / The Shortcut)

➡️ The Shortcut Skinny: VALL-E

🎤 Microsoft’s new text-to-speech AI will be able to simulate your voice
🗣️ It only needs three seconds of audio to work
💬 The AI could be used for text-to-speech applications
😳 Microsoft says VALL-E will even preserve your emotional tone during playback

Advances in AI continue to straddle the line between being absolutely amazing and undeniably scary, and Microsoft’s VALL-E technology is yet another artificial intelligence breakthrough that hits that unnerving sweet spot.

Microsoft says its new text-to-speech model can accurately simulate a person’s voice using only three seconds of audio. Once VALL-E has your voice locked in, it can synthesize what you’re saying in a way that preserves the speaker’s emotional tone.

VALL-E’s creators say the AI could be used in high-quality text-to-speech applications, editing a recording, and audio content creation. Basically don’t be surprised if we turn to AI for our writing, speaking and driving needs in the very near future.

The VALL-E website contains several samples of how the AI captures the speaker’s voice and then turns it into something that sounds much more akin to an actual human than what other text-to-speech generators can offer.

Of course, like any AI or deep fake technology, bad actors may be rubbing their hands at the prospect of being able to convincingly edit what people say. Misinformation is already rife on the Internet, and VALL-E does sound like another potential tool that could propagate this.

Microsoft is aware of these concerns, however, and noted in its research:

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

We’ve recently seen Apple introduce its own form of text-to-speech AI, which can transform Apple Books into audiobooks. The digital narration service has been soft-launched and currently works on select romance and fiction titles.

Apple says it introduced the technology to reduce the financial barrier for independent publishers who may not have the budget to create an audiobook version of their work.

The Shortcut

Microsoft's VALL-E can mimic your voice using just 3 seconds of audio

The text-to-speech AI model promises to preserve a speaker's emotional tone

➡️ The Shortcut Skinny: VALL-E

Discussion about this post