Authors: Yoshiki Masuyama1,2, Xuankai Chang2, Samuele Cornell2,3, Shinji Watanabe2, Nobutaka Ono1.
Affiliation: 1Tokyo Metropolitan University, Japan 2Carnegie Mellon University, USA 3Università Politecnica delle Marche, Italy.
Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset.
Published in: To appear in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in Doha, Qatar.
Contents:
CHiME-4 Examples
REVERB Examples
References
CHiME-4 Examples
Eval Set Real: F05_440C020I_BUS_REALEval Set Simu: F05_442C020Y_BUS_SIMU
Eval Set Real: F05_440C020I_BUS_REAL
Input Mixture (Channel 0) Ground-truth transcript: there shouldn't be any risk to the banks in this sort of stuff said lawrence cohn a banking analyst at merrill lynch and company
|
IRIS +Joint Training (Channel 0) Predicted transcript: there shouldn't be any risk to the banks in this sort ** stuff said lawrence POWER a banking analyst at MARYLAND COLLEGE .
|
MPDR + WavLM (MultiIRIS) Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence **** * KING analyst at merrill lynch *** ******* ..
|
MPDR + WavLM + Joint Training (MultiIRIS) Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence KING a banking analyst at merrill lynch *** ******* .
|
MVDR + WavLM (MultiIRIS) Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence KING a banking analyst at merrill lynch *** company
|
MVDR + WavLM + Joint Training (MultiIRIS) Predicted transcript: there shouldn't be any risk to the banks in this sort of ***** said lawrence cohn a banking analyst at merrill lynch and company.
|
WPD + WavLM (MultiIRIS) Predicted transcript: there shouldn't be any risk to the banks in this sort of **** said lawrence cohn a banking analyst at merrill lynch company.
|
WPD + WavLM + Joint Training (MultiIRIS) Predicted transcript: there shouldn't be any risk to the banks in this sort of **** said lawrence cohn a banking analyst at merrill lynch company.
|
Eval Set Simu: F05_442C020Y_BUS_SIMU
Input Mixture (Channel 0) Ground-truth transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market.
|
IRIS + Joint Training (Channel 0) Predicted transcript: he also said **** THIS company for the first time IS developing GROWTH IN specifically *** the over the counter COMPANY health care ****** .
|
MPDR + WavLM (MultiIRIS) Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market
|
MPDR + WavLM + Joint Training (MultiIRIS) Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market
|
MVDR + WavLM (MultiIRIS) Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market
|
MVDR + WavLM + Joint Training (MultiIRIS) Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market
|
WPD + WavLM (MultiIRIS) Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market
|
WPD + WavLM + Joint Training (MultiIRIS) Predicted transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market
|
Clean Speech (Channel 0) Ground-truth transcript: he also said that the company for the first time was developing drugs specifically for the over the counter consumer health care market.
|
REVERB Examples
Eval Set Real: t21_RealData_et_for_8ch_far_room1_A_t21c0211Eval Set Simu: c30_SimData_et_for_8ch_far_room1_A_c30c020a
Eval Set Real: t21_RealData_et_for_8ch_far_room1_A_t21c0211
Input Mixture (Channel 0) Ground-truth transcript: this isn't the kind of thing that will prove effective immediately one senior bank economist said .
|
WPD + WavLM + Joint Training (MultiIRIS) (trained on CHiME4 data) Predicted transcript: this isn't the kind of thing THAT will ** PROVE effective immediately **** **** one SENIOR BANK ECONOMIST said .
|
WPD + WavLM + Joint Training (MultiIRIS) Predicted transcript: this isn't the kind of thing that will prove effective immediately one senior bank economist said .
|
Eval Set Simu: c30_SimData_et_for_8ch_far_room1_A_c30c020a
Input Mixture (Channel 0) Ground-truth transcript: in some cases there will be employment as usual.
|
WPD + WavLM + Joint Training (MultiIRIS) (trained on CHiME4 data) Predicted transcript: in some cases there will be employment as usual .
|
WPD + WavLM + Joint Training (MultiIRIS) Predicted transcript: in some cases there will be employment as usual.
|
Clean (Channel 0) Ground truth transcript: in some cases there will be employment as usual.
|
References:
[1] Chang, Xuankai, et al. "End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation." arXiv preprint arXiv:2204.00540 (2022). [paper]