SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

*Paarth Neekhara1, *Shehzeen Hussain1, Rafael Valle1, Boris Ginsburg1, Rishabh Ranjan2, Shlomo Dubnov2, Farinaz Koushanfar2, Julian McAuley2

* Equal Contribution
1NVIDIA, 2University of California San Diego

Read our Paper

We present audio examples for our paper SelfVC: Voice Conversion With Iterative Refinement using Self Transformations. To perform zero-shot voice conversion, we use our synthesis model to combine the content embedding of any given source utterance with the speaker embedding of the target speaker derived from a speaker verification model using 10 seconds of audio of the target speaker. Our synthesizer can perform voice conversion in two modes:

Zero-shot Any-to-Any Voice Convesion

For zero-shot any-to-any voice convrsion, we select 10 target speakers (5 random Male and 5 random Female) from the test-clean subset of the LibriTTS dataset. Next, we randomly select 20 source utterances from the remaining speakers and perform voice conversion for each of the 10 target speakers. We present a few audio examples for this experiment in the table below.

Conversion Type Source Utterance Target Speaker Ours - SelfVC (Predictive) Ours - SelfVC (Guided)

Comparison Against Past Work

We present audio examples for the same pair of source and target audio using different voice conversion techniques including our own. The source uttreances and target speakers are selected from the test-clean LibriTTS in the same way as described above for zero-shot any-to-any voice conversion. In this setting, we use the predictive mode for pitch and guided mode for duration to ensure a fair comparison since previous techniques preserve the duration of the source utterance. We produce audio examples for prior techniques using the voice convesion inference script provided in the respective official github repositories.

Conversion Type Source Utterance Target Speaker MediumVC S3PRL-VC YourTTS ACE-VC Ours - SelfVC

Cross Lingual Voice Conversion

For Cross lingual voice conversion, we use the CSS10 dataset that contains utterances from 10 different languages (Chinese, Greek, Finnish, Spanish, Dutch, German, Japanese, French, Hungarian, Russian). We consider three voice conversion tasks: English to CSS10, CSS10 to English, and CSS10 to CSS10. We present voice converted examples from two of our models: 1) SelfVC (LibriTTS) which is trained only on English speech from train-clean-360 subset of LibriTTS. 2) SelfVC (LibriTTS + CSS10) which is finetuned on both English and CSS10 speech.

Conversion Type Source Utterance (CSS10) Target Speaker (English) Ours - SelfVC (LibriTTS) Ours - SelfVC (LibriTTS + CSS10)
Conversion Type Source Utterance (English) Target Speaker (CSS10) Ours - SelfVC (LibriTTS) Ours - SelfVC (LibriTTS + CSS10)
Conversion Type Source Utterance (CSS10) Target Speaker (CSS10) Ours - SelfVC (LibriTTS) Ours - SelfVC (LibriTTS + CSS10)

Manual Prosody Control

Besides the above two inference modes (guided and predictive), SelfVC also offers fine-grained control over the prosody of the synthesized speech. During inference, we can simply modify the pitch contour (normalized F0 Contour) and extracted duration of the source utterance to control the prosody of the synthesized speech. We present audio examples obtained by scaling the reference pitch contour and duration by a factor. This behaviour is similar to ACE-VC, except that we do not require any text transcriptions during training to extract duration targets for the duration predictor.

Conversion Type Source Utterance Target Speaker Same Pace Fast Pace (1.5 X) Slow Pace (0.7 X)
Conversion Type Source Utterance Target Speaker Same Pitch (1X) Higher Pitch (3X) Lower Pitch (0.5X)

Bonus! SelfVC Finetuned on Celebrity Voices

Finally, we present audio examples from SelfVC finetuned on just a few minutes of speech data from different Celebrities. Since SelfVC is trained in a text-free manner, we can adapt it for any speaker with only audio data. For this experiment, we download audio monologues (roughly 5-10 minutes per speaker) for each speaker from Youtube and finetune SelfVC (LibriTTS) on the combined data. We accompany the generated audio with lip-synced video avatars generated from One Shot Talking Face

We also have a live demo where you can upload your own audio and convert it to any of the celebrity voices below.

Disclaimer: This demo is for academic and research purposes only. We do not own the rights to the audio or video content used in this demo.

Source Utterance Target Speaker SelfVC Generated Audio