Saying “hello” to answer a phone call and “hello” to a pet cat sound very different coming from the same person. How then can machines interpret what speakers mean, when what’s said can vary under different conditions? Three UCLA faculty members are looking at this problem.  Abeer Alwan, professor of electrical and computer engineering who leads the Speech Processing and Auditory Perception Laboratory, explores at how humans produce speech, how they perceive it, and then how that knowledge can be applied into applications that process speech. For many years, she’s worked with two longtime collaborators  – Patricia Keating, a professor of linguistics and director of the Phonetics Lab, and Jody Kreiman, a professor-in-residence at the Department of Head and Surgery and co-director of the Bureau of Glottal Affairs.  The trio recently received a National Science Foundation grant to understand variability in human speech and how that affects speaker identification. In other words, under different conditions, from reading prepared text, to talking to a close friend, to talking excitedly to a pet, they’ll try to find out how humans deal with variability in voice quality and how machines can better identify speakers under a wide range of conditions.

Two spectrograms of a person saying “Go Bruins.” The one on the left is spoken in a ‘regular’ way while the one on the right is spoken with great excitement. A spectrogram is a time-frequency display where time is on the x-axis and frequency is on the y-axis and the darkness is related to the energy.

What started your interests in speech processing?

My interest in speech processing started when I was a graduate student of EECS at MIT. While visiting research labs, I was intrigued by the fact that the Speech Communication Lab there had not only electrical engineers and computer scientists but also linguists, psychologists, and speech pathologists. At that time, in the mid 80s, it was unusual to witness such a multidisciplinary approach in engineering. Now it is quite common and appreciated today!

The first thing that might come to someone’s mind today regarding speech processing, are digital assistants like Alexa, Siri, or Google. How “smart” are these systems and what are the current bottlenecks?

Speech technology has come a long way and now everyone can access a machine that recognizes speech, known as automatic speech recognition or ASR, and/or talks, which is synthesis. The technology has important and far-reaching applications, such as educational tools, assistive technology for the disabled, diagnostic tools for certain speech disorders or even mental illness, and many others. Some of the bottlenecks of ASR are: limited data, mismatched training and test data, and variability. Since these systems are statistical in nature, they need a significant amount of data to be able to produce reliable results, hence, the limited data problem. In addition, if these systems are trained in quiet conditions and then tested in realistically noisy condition, for example, speaking in a moving car with the windows open and the radio on, their performance becomes not too reliable (the mismatched problem). Moreover, the same person can speak very differently depending on her/his emotional or health status.

On this new NSF grant you’ll again be working with Patricia Keating and Jody Kreiman. What part of the research will each of you focus on, and together what are you hoping to find out and/or develop from it?

L to R: Abeer Alwan, Jody Kreiman and Patricia Keating.

The new NSF grant is a very challenging and exciting one. It focuses on human and machine recognition of speaker identity in challenging situations such as when the speech recorded is short (less than 30 seconds) and when the speaking style varies. Humans outperform machines in these difficult situations. We will conduct human perceptual experiments, under the guidance of professors Keating and Kreiman and compare the results with machine performance (in my lab and in collaboration with researchers at the Johns Hopkins University). The human and machine speaker identity experiments will use data from speakers reading a paragraph, talking to someone they don’t know and then to someone they know well, speaking with excitement, such as when they talk to pets, kittens and puppies in our case!. The three of us have been working together for more than 10 years. Professor Keating is a phonetician, Professor Kreiman specializes in the study of voice, and I am an engineer. Together, we are interested in understanding and modeling how humans perceive speaker identity and compare that to automated systems.