End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation[paper]

Authors: Yoshiki Masuyama^1,2, Xuankai Chang², Samuele Cornell^2,3, Shinji Watanabe², Nobutaka Ono¹.

Affiliation: ¹Tokyo Metropolitan University, Japan ²Carnegie Mellon University, USA ³Università Politecnica delle Marche, Italy.

Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.

Published in: To appear in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar.

CHiME-4 Examples

Eval Set Real: F05_440C020I_BUS_REAL
Eval Set Simu: F05_442C020Y_BUS_SIMU

Eval Set Real: F05_440C020I_BUS_REAL

Input Mixture (Channel 0)
Ground-truth transcript: there shouldn't be any risk to the banks in this sort of stuff said lawrence cohn a banking analyst at merrill lynch and company

IRIS +Joint Training (Channel 0)
Predicted transcript: there shouldn't be any risk to the banks in this sort ** stuff said lawrence POWER a banking analyst at MARYLAND COLLEGE .

MPDR + WavLM (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence **** * KING analyst at merrill lynch *** ******* ..

MPDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence KING a banking analyst at merrill lynch *** ******* .

MVDR + WavLM (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence KING a banking analyst at merrill lynch *** company

MVDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence cohn a banking analyst at merrill lynch and company.

WPD + WavLM (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of **** said lawrence cohn a banking analyst at merrill lynch company.

WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of **** said lawrence cohn a banking analyst at merrill lynch company.

Eval Set Simu: F05_442C020Y_BUS_SIMU

Input Mixture (Channel 0)
Ground-truth transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market.

IRIS + Joint Training (Channel 0)
Predicted transcript: he also said **** THIS company for the first time IS developing GROWTH IN specifically *** the over the counter COMPANY health care ****** .

MPDR + WavLM (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market

MPDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market

MVDR + WavLM (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market

MVDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market

WPD + WavLM (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market

WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market

Clean Speech (Channel 0)
Ground-truth transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market.

REVERB Examples

Eval Set Real: t21_RealData_et_for_8ch_far_room1_A_t21c0211
Eval Set Simu: c30_SimData_et_for_8ch_far_room1_A_c30c020a

Eval Set Real: t21_RealData_et_for_8ch_far_room1_A_t21c0211

Input Mixture (Channel 0)
Ground-truth transcript: this isn't the kind of thing that will prove effective immediately one senior bank economist said .

WPD + WavLM + Joint Training (MultiIRIS) (trained on CHiME4 data)
Predicted transcript: this isn't the kind of thing THAT will ** PROVE effective immediately **** **** one SENIOR BANK ECONOMIST said .

WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: this isn't the kind of thing that will prove effective immediately one senior bank economist said .

Eval Set Simu: c30_SimData_et_for_8ch_far_room1_A_c30c020a

Input Mixture (Channel 0)
Ground-truth transcript: in some cases there will be employment as usual.

WPD + WavLM + Joint Training (MultiIRIS) (trained on CHiME4 data)
Predicted transcript: in some cases there will be employment as usual .

WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: in some cases there will be employment as usual.

Clean (Channel 0)
Ground truth transcript: in some cases there will be employment as usual.

References:

[1] Chang, Xuankai, et al. "End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation." arXiv preprint arXiv:2204.00540 (2022). [paper]