Microsoft has made a major advance in the realm of artificial intelligence (AI) with the release of VALL-E, an AI system that can correctly mimic any person’s voice. Unlike standard text-to-speech models that employ waveforms, VALL-E takes a three-second sample of someone’s voice, divides it into tokens, and uses these tokens to generate new sounds depending on the rules it has learned. This means that the AI system can detect and imitate characteristics in a person’s voice, such as tone, pitch, and speaking style.
VALL-E was created utilizing EnCodec, a new Meta technology meant to compress audio files 10 times without sacrificing quality. Because the majority of the 60,000 hours of English speech in the training library, LibriLight, are from audiobooks, VALL-E works best when the voice being synthesized sounds like one of the voices in the training library. The library contains over 7,000 voices, making it very straightforward for VALL-E to impersonate diverse sounds.
VALL-E imitates not only the person’s voice but also the acoustic surroundings from the three-second sample. This implies that if the sample was captured over the phone, it would sound different than if it was recorded in person, and the environment’s distinctive noises would also be considered.
Microsoft researchers are aware of the risks offered by VALL-E, such as the possibility that bad actors would use the technology to impersonate politicians or celebrities, or to utilize recognizable voices to deceive people into disclosing personal information or money. The researchers have not made VALL-E’s code public, and their paper includes an ethical statement.
Of course, this technology raises major concerns about the possible threats it may pose to the world. Malicious actors can utilize this technology to commit different sorts of deceit and fraud. As an example:
- VALL-E can be used to distribute false information or propaganda by imitating the voices of politicians, celebrities, or other popular personalities. The AI system may also be used to imitate trustworthy persons, such as family members, friends, or authority officials, in order to deceive people into disclosing sensitive information or money.
- Phishing schemes: VALL-E may be used to imitate the voices of trustworthy institutions, such as banks or technology companies, in order to perform phone-based phishing scams. People are duped into providing personal information or money in these frauds.
- VALL-E may also be used to change audio recordings, such as changing the substance of speeches or news broadcasts, or fabricating bogus audio evidence in court.
- Psychological injury: When voice imitation technology is used to harass or intimidate someone, it can inflict psychological harm. It can also cause widespread distrust of recorded audio, making it more difficult for people to assess the veracity in news stories and other critical information.
These are only a handful of the risks posed by VALL-E and other AI technologies. As artificial intelligence (AI) advances, it is critical to address the ethical implications of emerging technologies and take efforts to reduce the hazards they bring to society. This might involve creating new artificial intelligence systems to detect bogus audio, enacting stricter restrictions and policies, and educating the public about the hazards of speech imitation technologies.
The researchers believe that building models that can discriminate between authentic and fraudulent audio samples might lessen the threats posed by these AI algorithms. It remains to be seen whether these AI technologies will have a net positive impact on society or whether they will require the use of additional AI systems to safeguard humanity from them.