Abstract: Currently, a common approach in many speech processing fields is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data could be problematic especially for sensitive domains and conversational speech scenarios, due both to privacy issues and annotation costs. To address this, synthetic data generation has been employed, yet, for multi-speaker cases, it often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external non-conversational speech datasets and, at least for the scenarios considered, can even be a valid alternative to using in-domain data, when this latter is limited to few hours.
Llama-3 generated 2 speakers conversations.
NOTE 1: synthesized conversations from Fisher transcripts are not included here because of their copyright.
NOTE 2: Additional Parakeet TTS examples are available at jordandarefsky.com
Sample 1
Sample 2
Sample 3
Sample 1
|
TTS Model: xTTS-v2 (speakers IDs are cloned from LibriSpeech)
|
Sample 2
|
TTS Model: xTTS-v2 (speakers IDs are cloned from LibriSpeech)
|
Sample 3
|
TTS Model: xTTS-v2 (speakers IDs are cloned from LibriSpeech)
|