End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation[paper]

Authors: Yoshiki Masuyama1,2, Xuankai Chang2, Samuele Cornell2,3, Shinji Watanabe2, Nobutaka Ono1.

Affiliation: 1Tokyo Metropolitan University, Japan 2Carnegie Mellon University, USA 3Università Politecnica delle Marche, Italy.

Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.

Published in: To appear in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar.

Contents:

CHiME-4 Examples
REVERB Examples
References

CHiME-4 Examples

Eval Set Real: F05_440C020I_BUS_REAL
Eval Set Simu: F05_442C020Y_BUS_SIMU

Eval Set Real: F05_440C020I_BUS_REAL

Input Mixture (Channel 0)
Ground-truth transcript: there shouldn't be any risk to the banks in this sort of stuff said lawrence cohn a banking analyst at merrill lynch and company


IRIS +Joint Training (Channel 0)
Predicted transcript: there shouldn't be any risk to the banks in this sort ** stuff said lawrence POWER a banking analyst at MARYLAND COLLEGE .


MPDR + WavLM (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence **** * KING analyst at merrill lynch *** ******* ..


MPDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence KING a banking analyst at merrill lynch *** ******* .


MVDR + WavLM (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence KING a banking analyst at merrill lynch *** company


MVDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence cohn a banking analyst at merrill lynch and company.


WPD + WavLM (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of **** said lawrence cohn a banking analyst at merrill lynch company.


WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: there shouldn't be any risk to the banks in this sort of **** said lawrence cohn a banking analyst at merrill lynch company.


Eval Set Simu: F05_442C020Y_BUS_SIMU

Input Mixture (Channel 0)
Ground-truth transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market.


IRIS + Joint Training (Channel 0)
Predicted transcript: he also said **** THIS company for the first time IS developing GROWTH IN specifically *** the over the counter COMPANY health care ****** .


MPDR + WavLM (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market


MPDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market


MVDR + WavLM (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market


MVDR + WavLM + Joint Training (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market


WPD + WavLM (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market


WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market


Clean Speech (Channel 0)
Ground-truth transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market.


REVERB Examples

Eval Set Real: t21_RealData_et_for_8ch_far_room1_A_t21c0211
Eval Set Simu: c30_SimData_et_for_8ch_far_room1_A_c30c020a

Eval Set Real: t21_RealData_et_for_8ch_far_room1_A_t21c0211

Input Mixture (Channel 0)
Ground-truth transcript: this isn't the kind of thing that will prove effective immediately one senior bank economist said .


WPD + WavLM + Joint Training (MultiIRIS) (trained on CHiME4 data)
Predicted transcript: this isn't the kind of thing THAT will ** PROVE effective immediately **** **** one SENIOR BANK ECONOMIST said .


WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: this isn't the kind of thing that will prove effective immediately one senior bank economist said .


Eval Set Simu: c30_SimData_et_for_8ch_far_room1_A_c30c020a

Input Mixture (Channel 0)
Ground-truth transcript: in some cases there will be employment as usual.


WPD + WavLM + Joint Training (MultiIRIS) (trained on CHiME4 data)
Predicted transcript: in some cases there will be employment as usual .


WPD + WavLM + Joint Training (MultiIRIS)
Predicted transcript: in some cases there will be employment as usual.


Clean (Channel 0)
Ground truth transcript: in some cases there will be employment as usual.


References:

[1] Chang, Xuankai, et al. "End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation." arXiv preprint arXiv:2204.00540 (2022). [paper]