Prospective capabilities of VALL-E text-to-speech model create opportunities, raise ethical concerns over misuse

In a 2019 deepfaking scandal, fraudsters stole over $240,000 from a British energy company using realistic voice-impersonating tools to pose as the company’s German boss.

Deepfakes, video and audio recordings with realistic human speech and facial expressions created using artificial intelligence, have become more widespread in recent years.

Microsoft recently announced the development of a new text-to-speech model called VALL-E. Computer Science teacher Matthew Rossillon said the algorithm uses tone recognition to convert short audio recordings into completely different sentences while preserving the speaker’s emotional tone.

Text-to-speech models like Apple’s Siri or Amazon’s Echo are common. But Rosillion said Neural TTS, the technique VALL-E uses, imitates speech through a multi-step training process that differs from traditional TTS methods.

“Neural TTS generates speech by training a model on a large dataset of speech samples and then fine-tuning the model on a smaller dataset of samples from the target speaker,” Rossillon said. “This is different from traditional TTS methods that most generators use which apply a rule-based approach to convert text to speech.”

Using just a three-second audio sample, VALL-E accurately changes the sample’s tone by mimicking the background noises and audio quality of the original sample.

Rossillon said VALL-E raises questions about the potential benefits and dangers of its ability to closely match natural speech.

“VALL-E can be used for a wide range of applications such as accessibility, entertainment, and personalization,” Rossillon said. “However, it also raises ethical concerns, such as the potential for creating deepfake audio or impersonating someone without their consent.”

Junior Jackie He, enrolled in AP Computer Science A, said he thinks VALL-E makes spreading misinformation even easier and audio recordings less trustworthy.

“(Before), if you (had) an audio recording of someone saying something, you (could) almost guarantee that they actually did say it because it’s so hard to fake its authenticity,” He said. “But with (this) software, you can’t be sure of that anymore.”

Senior Sean Sun said he hopes VALL-E and other TTS technologies will bring more satisfaction to people who reguarly interact with voice messages and call bots.

“A lot of hospital clinics call people using automated messages, (but) if the emitted messages sound more human, it brings a better experience for the people receiving the call,” Sun said. “A lot of people, when they receive an automated message that sounds clearly robotic, feel less inclined to reply compared to a direct message from a real person.”

He said VALL-E can help save time and money and Microsoft should continue to improve its text-to-speech synthesizer.

“It helps with saving human labor, such as for cartoon voice actors, where they don’t have to be in the studio recording every single line, but they can just record a few words and let the software do everything else,” He said. “(Microsoft) should keep going because there’s a lot of potential in it, and the benefits will outweigh the dangers.”

However, Sun said improved technology could also make scammers’ jobs easier.

“Having the ability to mimic human emotion and manipulate users into believing a person is talking can deceive many people,” Sun said. “A lot of call scam centers would abuse those automated calls, and they would make them seem more human and more believable.”

In the end, Rossillon said Microsoft must put protections in place to ensure VALL-E is used responsibly.

“It is important to consider the ethical implications of the technology and ensure that it is used for positive purposes and that appropriate safeguards are in place to prevent misuse,” Rossillon said. “It is also equally important to consider the impact on jobs and industries related to voice acting and dubbing.”

The Campanile

The Campanile

The Campanile

Prospective capabilities of VALL-E text-to-speech model create opportunities, raise ethical concerns over misuse

Comments (0)