BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Baidu Announces Breakthrough In Speech Recognition, Claiming To Top Google And Apple

This article is more than 9 years old.

When artificial-intelligence guru Andrew Ng joined Chinese Internet pioneer Baidu last May as chief scientist, he was a little cagey about what he and his team might work on at a newly opened lab in Sunnyvale, Calif. But he couldn't help revealing better speech recognition as a key area of interest in the age of the smartphone.

Today, Baidu, often called China's Google , unveiled the first results of what the former Google researcher, Stanford professor and Coursera cofounder had in mind. In a paper published today on Cornell University Library's arXiv.org site, Ng and 10 members of his Baidu Research team led by research scientist Awni Hannun said they've come up with a new method of more accurately recognizing speech, an increasingly important feature used in Apple's Siri and Dictation services as well as Google's voice search. Baidu's Deep Speech beat other methods such as those offered by Google and Apple on standard benchmarks that measure the error rate of speech recognition systems, according to Ng.

In particular, Deep Speech works better than the others in noisy environments, such as in a car or a crowd. That's key, of course, to making speech recognition truly useful in the real world. In noisy backgrounds, Ng said, tests showed that Deep Speech outperformed several speech systems--the Google Speech API, wit.ai, Microsoft's Bing Speech, and Apple Dictation--by over 10% in terms of word error rates.

Baidu offered supporting comments from two university professors. "This recent work by Baidu Research has the potential to disrupt how speech recognition will be performed in the future," Ian Lane, assistant research professor of engineering at Carnegie Mellon University, said in a press release. The company requested that the details not be revealed before this morning's publication of the paper, so Google, Apple, and others couldn't be contacted for comment. I'll add what they have to say if they choose to comment later.

Like other speech recognition systems, Baidu's is based on a branch of AI called deep learning. The software attempts to mimic, in very primitive form, the activity in layers of neurons in the neocortex, the 80 percent of the brain where thinking occurs, so deep learning systems learn to recognize patterns in digital representations of sounds, images, and other data--ideally lots and lots of data. "The first generation of deep learning speech recognition was reaching limits," Ng said in an interview.

The Baidu team collected some 7,000 hours of speech from 9,600 people, mostly in quiet environments--though sometimes speakers wore headphones playing loud background noise so they would change their pitch or inflections in the same way they would in a noisy environment. Then, using a principle of physics called superposition, the team added about 15 types of noise, such as ambient noise in restaurants, car, and subways, to those speech samples. That essentially amplified the speech samples to 100,000 hours of data. Then it let the system learn to recognize speech even amid all that noise.

It's a much simpler method than today's speech recognition systems, Ng says. They use a series of modules that analyze phonemes and other parts of speech which often require hand-designed modules using statistical probability systems called Hidden Markov Models, which require lots of human tuning to model noise and speaker variation. Baidu's system replaces those models with deep learning algorithms that are trained on a recurrent neural network, or simulation of connected neurons, making the system much simpler, Ng says.

What really allowed this method to work, however, is a powerful new computer system that uses many graphics processing units made by the likes of chipmaker Nvidia. GPUs are used for accelerating graphics in personal computers. Hooked up in parallel, they can train speech recognition models much more quickly and economically than standard computer processors--roughly 40 times faster than the systems Ng used in his work at Stanford and Google. "The algorithms are important, but a large part of why this works is the scalability," he says--both in the computer system and the volume of data it can process.

Without that kind of speed, trying to crunch all that data wouldn't be practical. He says it's more sophisticated than other GPU-based systems to date. "We're entering the era of Speech 2.0," he says. "And this is just the beginning."

Ng thinks speech recognition will become even more critical as less literate people use the Internet and prefer to speak instead of write queries. "Letting them talk to us is key," he says. He cited one example of a recent search query in China: "Hi, Baidu, how are you? I had noodles at lunch yesterday at the corner. Will they be on sale tomorrow?" Ng admits that's a very tough query to provide an answer to today, but he thinks better speech recognition will be the key.

It also will be important as the Internet of Things develops, bringing all manner of currently dumb devices online. He envisions a time when his grandchildren, assuming he eventually has them, marvel that we once had to deal with TV remotes and had microwave ovens that couldn't respond to voice commands. "Speech is an enabling technology for the Internet of Things," he says.

Ng declined to say exactly how long it would take for Baidu to incorporate the new speech recognition method into its search and other services. But asked if it could take years, he replied quickly, "Jesus Christ no!" So it seems likely to show up sometime in the new year. One "exploratory" project where the method might be applied is Baidu's Cool Box, a system that enables speech-activated music requests.

The work of Ng and his team, which now numbers about 30 and may double next year as well, will be instrumental in Baidu's attempt to elevate itself into the top ranks of Internet companies. Currently serving mostly the Chinese market, the company aims to expand its horizons internationally, which will involve developing world-class speech recognition, translation and other features.

Follow me on LinkedInCheck out my website