Whether you’re describing the sound of a broken-down car engine or the meowing of a neighborhood cat, voice mimicry is a useful way to communicate a concept when words alone aren’t enough.
Voice mimicry is the auditory equivalent of drawing a simple picture to communicate what you see. However, instead of using a pencil to draw, you use your larynx to represent sound. This may seem difficult, but it’s something we all do intuitively. To experience it for yourself, try using your own voice to mimic the sound of an ambulance siren, a crow cawing, or a bell.
Inspired by the cognitive science of how we communicate, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an AI system that can mimic human-like speech without any training and without having ever “heard” a human voice.
To achieve this, the researchers designed a system that produces and interprets sound in a similar way to humans. They began by building a model of the human speech tract that simulates how vibrations from the larynx are shaped by the throat, tongue, and lips. They then used cognitively inspired AI algorithms to control this vocal tract model, generating a replica that takes into account the situation-specific ways humans choose to deliver sound.
The model can effectively take a wide variety of sounds present in the world, such as rustling leaves, a snake hissing, or an approaching ambulance siren, and create a human-like replica. Their model can also run backwards to predict real-world sounds from imitated human speech, similar to how some computer vision systems derive high-quality images based on sketches. For example, the model can accurately distinguish between a person imitating a cat “meowing” and a cat “hissing.”
In the future, the model could lead to more intuitive “simulation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even ways to help students learn new languages.
Lead authors, MIT CSAIL doctoral students Kartik Chandra SM ’23 and Karima Ma, and faculty researcher Matthew Caren, note that computer graphics researchers have long recognized that realism is rarely the end goal of visual representation: An abstract painting or a child’s crayon drawing, for example, can be just as expressive as a photograph.
“Over the past few decades, advances in sketching algorithms have led to new tools for artists, advances in AI and computer vision, and a deeper understanding of human cognition,” Chandra points out. “Just as a sketch is an abstract image rather than an actual photograph, our method represents the sounds people hear in an abstract rather than actual way. This tells us about the process of auditory abstraction.”

“The goal of this project is to understand and model the process of computational vocal imitation, which we think of as the auditory equivalent of sketching in the visual domain,” Cullen says.
The Art of Imitation, in 3 parts
The research team developed three refinements of their model to compare with a human speech simulation: First, they created a baseline model whose sole aim was to create a simulation as close as possible to real-world sounds, but this model did not match human behavior very well.
The researchers then designed a second “communication” model, which Cullen says focuses on what’s special about a sound to a listener. For example, it can mimic the sound of a motorboat by imitating the rumble of an engine, because that’s its most distinctive auditory feature, if not the loudest, of all motorboat sounds (compared to, say, the splashing of water). This second model produced a better imitation than the base model, but the team wanted to improve it even further.
To take this method one step further, the researchers added a final layer of reasoning to their model. “Voice imitations sound different depending on how hard you try. It takes time and energy to get it perfectly right,” Chandra says. The researchers’ full model accounts for this by trying to avoid words that are very fast, loud, or have a high or low pitch — words that people don’t use often in conversation. The result is a more human-like imitation that roughly matches the judgments humans make when imitating the same sounds.
After building this model, the research team conducted behavioral experiments to test whether AI-generated or human-generated speech was rated by human judges. Notably, participants preferred the AI model by 25 percent overall, up to 75 percent for the motorboat imitation, and 50 percent for the gunshot imitation.
Towards more expressive sound technology
Passionate about technology in music and art, Caren believes this model could help artists better communicate sounds to computational systems, empower filmmakers and other content creators to create more nuanced AI sounds for specific contexts, and even let musicians quickly search sound databases by mimicking noises that are hard to describe, for example through text prompts.
Meanwhile, Cullen, Chandra and Marr are investigating the implications of their model in other areas, such as language development, how infants learn to speak, and even the imitative behavior of birds such as parrots and songbirds.
The team still has some kinks to work on with the current version of the model: it struggles with some consonants like “z” and gives inaccurate impressions of some sounds, like the buzzing of a bee, and it can’t yet replicate the way humans imitate speech, music, or sounds like a heartbeat, which are imitated differently across languages.
Language is full of onomatopoeia, or words that mimic but don’t exactly copy what they represent, like the very vague “meow,” which sounds a lot like a cat’s meow, says Robert Hawkins, a professor of linguistics at Stanford University. “The journey from real cat sounds to words like ‘meow’ reveals a lot about the complex interplay between physiology, social inference, and communication in the evolution of language,” said Hawkins, who was not involved in the CSAIL study. “This model is an exciting step toward formalizing and testing theories about such processes, and shows that both the physical constraints imposed by the human vocal apparatus and the social pressures of communication are necessary to explain the distribution of vocal mimicry behavior.”
Cullen, Chandra and Ma authored the paper with two other CSAIL associates, Jonathan Ragan-Kelly, an associate professor in the MIT Department of Electrical Engineering and Computer Science, and Joshua Tenenbaum, an MIT professor of brain and cognitive sciences and a member of the Center for Brain, Mind, and Machines. Their research was supported in part by the Hertz Foundation and the National Science Foundation. The paper was presented at SIGGRAPH Asia in early December.