Moneycontrol
HomeNewsTechnologyMicrosoft's new text-to-speech technology could be massive

Microsoft's new text-to-speech technology could be massive

Microsoft's neural codec language model, VALL-E, tokenises speech and uses algorithms to create waveforms that sound like the speaker while keeping their timbre and emotional tone.

February 19, 2023 / 21:52 IST
Story continues below Advertisement
Microsoft VALL-E, a new text-to-speech AI

Text-to-speech (TTS) AI, in many areas, such as healthcare and education, has been able to ease operations and helped multitask, whether at home or on the job. Think of voice bots screening COVID-19 patients, in minimised in-person contact situations, and reducing the burden of tasks on physicians. But also think of areas where it is an enabler, whether it facilitates reading, or assists persons with disability. And who's the greatest example but Stephen Hawking, who used a software via a synthesised voice on his computer, and that, the late physicist's voice, can now be accessed by many.

TTS is a standard assistive technology in which a computer or tablet reads the text on the screen out loud to the user. And hence, this device is popular among children with reading issues, particularly those struggling with decoding.

Story continues below Advertisement

TTS can turn written words on a computer or digital device into sound. TTS is great for children who have trouble reading but can also help them write, edit, and even pay attention. It lets any digital content have its voice, no matter what it is (application, websites, ebooks, online documents). In addition, TTS systems provide a seamless method for reading textual content from mobile devices and desktops. These solutions are gaining popularity since they offer a high level of convenience for both personal and professional applications to the readers. Microsoft has just developed a new TTS strategy.

Microsoft's VALL-E is a neural codec language model. The AI tokenises speech before using its algorithms to construct waveforms that sound like the speaker while retaining the speaker's timbre and emotional tone.