Audio Deep Fakes

Audio Deep Fakes

Voice Cloning

·

3 min read

We had a meeting today at which point I was only listening in due to being on-route. I jokingly thought I could probably record the conversations and use them for a deep fake. I had no idea how easy it could be.

Turns out there are already deep fake audio apps and sites that allow for the translation of text-to-voice using the synthesized voice of celebrities/voice actors.

My testing was with Morty Smith of Rick and Morty fame:

Text-to-Voice Result

How does it work?

Essentially the voice clone is produced by neural networks trained on machine learning (ML) models. These models are mathematical representations of a system or process that uses statistical models and algorithms to make predictions based on input data. This is different from just using static algorithms in traditional programming. A lot of the newly advanced "AI" or ML models are due to this paper.

Data

Before any ML model can be created it needs to be trained on data. The CorentinJ/Real-Time-Voice-Cloning project was trained on audiobook voice actors. How much data and training time is needed will depend on the implementation and what the output would be "good enough". This is why sites like Resemble AI will allow you to clone your voice for free so it can get free data. Sometimes you might not have enough data or the data isn't good enough. There may be projects like kuleshov/audio-super-res and TrizeX/Audio-SuperRes to clean up and super-sample/up-sample the data. Similar to NVIDIA's DLSS, but with audio.

Extract Features

Features such as pitch, tempo, and spectral characteristics are extracted from the sampled audio data.

Training

Training the model involves feeding the extracted features to the model. The model "learns" the patterns and nuances of the input audio data.

Output

The trained model is then used in conjunction with Text-to-Speech (TTS) to produce corresponding audio output that sounds like the inputted voice data. With real-time voice cloning, you feed in a few seconds of voice audio and some typed text and it can output the text in audio format with the text translated to audio resembling the input voice data. However, I noticed that the longer the typed text the more distorted the voice becomes reading it.

Applications

Some applications that come to mind:

  • More realistic narration for people and business use

Some malicious applications that come to mind:

  • Circumventing biometrics using voice authorization.

  • Pranking or fooling people over the phone or video chat (requires deep faking visuals).

Updates

References: