Chatbots that sound just like us

3 min read

Voice cloning breakthrough will make AI assistants sound less synthetic

ABOVE OpenVoice claims it can clone any voice – and say it with emotion

When you’re talking to Siri, Alexa or the Google Assistant, you could be forgiven for thinking that they sound a bit flat. Whether it’s sharing the excitement that your dinner delivery is nearby, or breaking the bad news that thunderous showers are expected, voice assistants all sound like emotionless robots. But all that could be about to change.

OpenVoice is a new proof-of-concept AI tool that, its creators claim, can clone any voice with only a few seconds of reference audio.

What really sets it apart from previous attempts at voice cloning is that it’s significantly more flexible, as different speech characteristics can be easily tweaked. For example, you can instruct the AI to make the voice sound happy, sad or even terrified. Or if you want to go for a different regional appeal, you can even switch the voice’s accent – perhaps making a previously American voice sound more British, Australian or South African.

The hope is the technology will improve how we interact with AI tools like chatbots, such as those made by parent company MyShell, to make them appear more lifelike. “The voice-cloning model enables very flexible content creation and the design of human-computer interfaces,” said Zengyi Qin, an MIT researcher who led the project. “So, this could be a very important part in the future of artificial general intelligence.

“Language, vision and voice are [the] three most important modalities in artificial intelligence in the future,” added Qin. “There are some pretty good open-source projects in the language models and vision models [categories], but nobody could do the voice models as good as the previous two.”

Voice breaking

So how is it done? The trick is to break the “reference” clip into its different components.

“Flexibility here means after you control the voice, you have flexible control of styles like the accent, like the emotion and the intonation,” said Qin. “To enable such level of detail control is very difficult, because that will require a huge dataset to train it and a very large model to learn it.”

Luckily, then, Qin and his colleagues were able to use around 100,000 clips of audio training data – and that appears to have been enough to figure it out.

What makes OpenVoice so powerful is that the AI breaks a voice clip into these different characteristics,

This article is from...

Related Articles

Related Articles