HomeNewsTechnologyMicrosoft's VALL-E AI can simulate human voice using three-second audio samples

Microsoft's VALL-E AI can simulate human voice using three-second audio samples

The text-to-speech model can synthesize human voices by listening to a three-second audio sample

January 11, 2023 / 15:04 IST
Story continues below Advertisement
(Image Courtesy: Reuters)
(Image Courtesy: Reuters)

Researchers from Microsoft have announced a new text-to-speech model that can simulate a person's voice by listening to a three-second audio sample.

VALL-E as it's called, can even preserve the speaker's emotional tone. The researchers call it a "neural codec language model" and it is built on the foundation of Meta's EnCodec compression model, which can compress audio into file sizes ten times smaller than MP3 at 64Kbps with no apparent loss to quality.

Story continues below Advertisement

VALL-E uses EnCodec to break the audio information in a file into small chunks to analyze. Instead of using waveforms to study the data, VALL-E generates codec codes from text and acoustic prompts.

It tries to match the three-second sample to different conditions and environments, simulating how it thinks the voice would sound. To do this, the researchers trained VALL-E on more than 60,000 hours of audio from more than 7,000 speakers in Meta's LibriLight audio library.