Researchers from Microsoft have announced a new text-to-speech model that can simulate a person's voice by listening to a three-second audio sample.
VALL-E as it's called, can even preserve the speaker's emotional tone. The researchers call it a "neural codec language model" and it is built on the foundation of Meta's EnCodec compression model, which can compress audio into file sizes ten times smaller than MP3 at 64Kbps with no apparent loss to quality.
VALL-E uses EnCodec to break the audio information in a file into small chunks to analyze. Instead of using waveforms to study the data, VALL-E generates codec codes from text and acoustic prompts.
It tries to match the three-second sample to different conditions and environments, simulating how it thinks the voice would sound. To do this, the researchers trained VALL-E on more than 60,000 hours of audio from more than 7,000 speakers in Meta's LibriLight audio library.
Because of this, VALL-E can also simulate what the voice would sound like in different sound environments.
The researchers are also aware of the potential misuse of VALL-E's application. In the post announcing the project, the researchers say, "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker."
To mitigate this, the researchers recommend, "a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model."
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
Find the best of Al News in one place, specially curated for you every weekend.
Stay on top of the latest tech trends and biggest startup news.