Masked-speech Recognition Using Human and Synthetic Cloned Speech.
Lauren Calandruccio, Mohsen Hariri, Emily Buss, Vipin Chaudhary
Abstract
Open AccessVoice cloning is used to generate synthetic speech that mimics vocal characteristics of human talkers. This experiment used voice cloning to compare human and synthetic speech for intelligibility, human-likeness, and perceptual similarity, all tested in young adults with normal hearing. Masked-sentence recognition was evaluated using speech produced by five human talkers and their synthetically generated voice clones presented in speech-shaped noise at -6 dB signal-to-noise ratio. There were two types of sentences: semantically meaningful and nonsense. Human and automatic speech recognition scoring was used to evaluate performance. Participants were asked to rate human-likeness and determine whether pairs of sentences were produced by the same versus different people. As expected, sentence-recognition scores were worse for nonsense sentences compared to meaningful sentences, but they were similar for speech produced by human talkers and voice clones. Human-likeness scores were also similar for speech produced by human talkers and their voice clones. Participants were very good at identifying differences between voices but were less accurate at distinguishing between human/clone pairs, often leaning towards thinking they were produced by the same person. Reliability scoring by automatic speech recognition agreed with human reliability scoring for 98% of keywords and was minimally dependent on the context of the target sentences. Results provide preliminary support for the use of voice clones when evaluating the recognition of human and synthetic speech. More generally, voice synthesis and automatic speech recognition are promising tools for evaluating speech recognition in human listeners.