Moneycontrol PRO
you are here: HomeNewsTechnology

Microsoft's new text-to-speech technology could be massive

Microsoft's neural codec language model, VALL-E, tokenises speech and uses algorithms to create waveforms that sound like the speaker while keeping their timbre and emotional tone.

February 19, 2023 / 09:52 PM IST
Microsoft VALL-E, a new text-to-speech AI

Microsoft VALL-E, a new text-to-speech AI

Text-to-speech (TTS) AI, in many areas, such as healthcare and education, has been able to ease operations and helped multitask, whether at home or on the job. Think of voice bots screening COVID-19 patients, in minimised in-person contact situations, and reducing the burden of tasks on physicians. But also think of areas where it is an enabler, whether it facilitates reading, or assists persons with disability. And who's the greatest example but Stephen Hawking, who used a software via a synthesised voice on his computer, and that, the late physicist's voice, can now be accessed by many.

TTS is a standard assistive technology in which a computer or tablet reads the text on the screen out loud to the user. And hence, this device is popular among children with reading issues, particularly those struggling with decoding.

TTS can turn written words on a computer or digital device into sound. TTS is great for children who have trouble reading but can also help them write, edit, and even pay attention. It lets any digital content have its voice, no matter what it is (application, websites, ebooks, online documents). In addition, TTS systems provide a seamless method for reading textual content from mobile devices and desktops. These solutions are gaining popularity since they offer a high level of convenience for both personal and professional applications to the readers. Microsoft has just developed a new TTS strategy.

Microsoft's VALL-E is a neural codec language model. The AI tokenises speech before using its algorithms to construct waveforms that sound like the speaker while retaining the speaker's timbre and emotional tone.

The study report claims that with only a three-second enrolled recording of an oblique speaker as the auditory stimuli, VALL-E can produce high-quality, individualised speech. The process requires no extra structural work, pre-planned acoustic elements, or fine-tuning to achieve the desired results. It's helpful for zero-shot TTS methods that rely on prompts and contextual learning.

Existing approaches

Current TTS approaches can be classified as either cascaded or end-to-end. In 2018, researchers from Google and the University of California, Berkeley, developed Cascaded TTS systems, which typically employ a pipeline consisting of an acoustic model.

In 2021, Korean researchers, along with Microsoft Research Asia, proposed an end-to-end TTS model to simultaneously optimise the acoustic model and vocoder to address the shortcomings of the vocoder. However, in real-world circumstances, it is desired to tailor a TTS system to any voice by enrolling uncommon recordings. Consequently, there is a rising interest in zero-shot multi-speaker TTS solutions, with most research focusing on cascaded TTS systems.

As pioneers, Baidu Research, California researchers propose ways for speaker adaptability and speaker encoding. Alongside, the Taiwan researchers apply meta-learning to speaker adaptability, which only requires five training examples to produce a well-performing system. Similarly, speaker encoding-based approaches have made significant strides in recent years. A system based on speaker encoding includes a speaker encoder and a TTS component, with the speaker encoder pre-trained on the speaker verification task.

Later, the experiments by Google researchers in 2019 demonstrated that the model could create high-quality outputs for in-domain speakers with three seconds of enrolled recordings. Similarly, using advanced speaker embedding models in 2018, the Chinese researchers increased the quality of unseen speakers, which still needs improvement. Furthermore, compared to prior work of Zhejiang University researchers from China, VALL-E continues the cascaded TTS tradition but uses audio codec code as intermediate representations. It is the first to have strong in-context learning capabilities as GPT-3, without the need for fine-tuning, pre-designed features, or a sophisticated speaker encoder.

How does it work?

VALL-E offers audio demonstrations of the AI model in action. One of the samples is a three-second-long audible cue called the "Speaker Prompt", which VALL-E must replicate. Two samples are provided; the first, labelled "Baseline", is representative of standard text-to-speech synthesis, while the second, "VALL-E", is the model's output.

According to the results of the evaluations, VALL-E outperforms the most advanced zero-shot TTS system on both LibriSpeech and VCTK. Furthermore, VALL-E even generated state-of-the-art zero-shot TTS results on LibriSpeech and VCTK.


VALL-E has come a long way, yet it still has the following problems, according to the researchers:

  • The study authors note that there are times when speech synthesis produces ambiguous, missing, or redundant words. The fundamental reason is that the phoneme-to-acoustic language segment is an autoregressive model, meaning there are no limitations on solving the problem, and thus attention alignments are disordered.

  • No amount of training data, not even 60,000 hours' worth, can account for every possible voice. It is especially true for accented speakers. Since LibriLight is an audiobook dataset, most utterances are spoken in a reading style. Hence the range of speaking styles needs to be increased.

  • The researchers have switched to using two models to forecast codes for various quantisers. Predicting them with a broad universal model is a promising path forward.

  • Possible dangers in misusing the model exist due to VALL-ability E's to synthesise speech while preserving speaker identity, which could lead to situations like voice ID spoofing or impersonation.


Neural networks and end-to-end modelling have improved voice synthesis in recent years. Cascaded text-to-speech (TTS) systems now use vocoders and acoustic models, with mel spectrograms serving as intermediate representations. Modern TTS systems can synthesise high-quality speech from a single speaker or a panel of speakers.

Furthermore, the TTS technology has been incorporated into various software and hardware, including navigation apps, e-learning platforms, and virtual assistants like Amazon's Alexa and Google Assistant. It's also utilised in advertising, marketing, and customer service to make interactions more exciting and relevant to the individual.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Nivash Jeevanandam is a senior research writer at INDIAai (Govt. of India) - National AI Portal of India | NASSCOM. Views expressed are personal.
first published: Feb 19, 2023 09:47 pm