Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [pre-print link]

Abstract: Currently, a common approach in many speech processing fields is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data could be problematic especially for sensitive domains and conversational speech scenarios, due both to privacy issues and annotation costs. To address this, synthetic data generation has been employed, yet, for multi-speaker cases, it often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external non-conversational speech datasets and, at least for the scenarios considered, can even be a valid alternative to using in-domain data, when this latter is limited to few hours.

Llama-3 generated 2 speakers conversations.

NOTE 1: synthesized conversations from Fisher transcripts are not included here because of their copyright.
NOTE 2: Additional Parakeet TTS examples are available at jordandarefsky.com

Sample 1
Sample 2
Sample 3

Sample 1

Generated transcript: [S1] I'm like, "What's going on?" [S2] Yeah, like, we're sitting there, and you're like, "What's going on?" And I'm like, "I don't know, man. I think we're just...I don't know. Like, what are we doing, man?" [S1] Yeah, like, we're just...we're just sitting there, man.

TTS Model: Parakeet

TTS Model: xTTS-v2 (speakers IDs are cloned from LibriSpeech)

Sample 2

Generated transcript: [S1] I don't think we're going to get a deal that we like. [S2] Yeah. [S1] I think we're going to have to make a decision based on the information we have, and I think that's what we're going to have to do. [S2] Yeah.

TTS Model: Parakeet

TTS Model: xTTS-v2 (speakers IDs are cloned from LibriSpeech)

Sample 3

Generated transcript: [S1] And I'm thinking, like, what are the chances that he's actually gonna go through with this? [S2] Right? It's like, come on, dude. You're a adult now. Act like one. [S1] Yeah, exactly. And she's like, "Oh, you're gonna do what? Go out there and get a job?" [S2] Yeah, basically.

TTS Model: Parakeet

TTS Model: xTTS-v2 (speakers IDs are cloned from LibriSpeech)