HomeNewsTechnologyMicrosoft's VALL-E AI can simulate human voice using three-second audio samples

Microsoft's VALL-E AI can simulate human voice using three-second audio samples

The text-to-speech model can synthesize human voices by listening to a three-second audio sample

(Image Courtesy: Reuters)

Researchers from Microsoft have announced a new text-to-speech model that can simulate a person's voice by listening to a three-second audio sample.

VALL-E as it's called, can even preserve the speaker's emotional tone. The researchers call it a "neural codec language model" and it is built on the foundation of Meta's EnCodec compression model, which can compress audio into file sizes ten times smaller than MP3 at 64Kbps with no apparent loss to quality.

Story continues below Advertisement

Remove Ad

VALL-E uses EnCodec to break the audio information in a file into small chunks to analyze. Instead of using waveforms to study the data, VALL-E generates codec codes from text and acoustic prompts.

It tries to match the three-second sample to different conditions and environments, simulating how it thinks the voice would sound. To do this, the researchers trained VALL-E on more than 60,000 hours of audio from more than 7,000 speakers in Meta's LibriLight audio library.

Download MC Apps:

Copyright © Network18 Media & Investments Limited. All rights reserved. Reproduction of news articles, photos, videos or any other content in whole or in part in any form or medium without express written permission of moneycontrol.com is prohibited.

English

Markets

News

Personal Finance

Mutual Funds

Commodities

Media

Invest Now

Specials

Microsoft's VALL-E AI can simulate human voice using three-second audio samples

The text-to-speech model can synthesize human voices by listening to a three-second audio sample

Related Stories

Trending Topics

News

Markets

Personal Finance

Mutual Funds

Tools

Community

Network 18 Sites

Quick Links