Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process.
Date
2021Metadata
Show full item recordAbstract
An End-Of-Turn Detection Module (EOTD-M) is an essential component of au- tomatic Spoken Dialogue Systems. The capability of correctly detecting whether a user’s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in di- alogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Un- derstanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR-M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that mod- els different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the dif- ferent ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.