BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Baidu's Surprising Search For The Holy Grail Of Artificial Intelligence

This article is more than 9 years old.

Can one set of formulas hold the key to making truly intelligent computers?

That’s the theory driving some of the leading lights of artificial intelligence, from Numenta cofounder and former Palm cofounder Jeff Hawkins to Andrew Ng, the Stanford computer science professor hired earlier this year, to many observers' surprise, as chief scientist of China search giant Baidu.

Finding the one true algorithm of perceptual learning, in particular the branch of AI called deep learning, is the key goal of Baidu’s new Silicon Valley AI lab in Sunnyvale, run by onetime Ng protege Adam Coates. A native of the wine country town of Calistoga, he learned programming in high school before becoming interested in machine learning at Stanford, where he bumped into Ng and started research with him on self-guided helicopters.

In an interview earlier this summer for a story I wrote on Baidu's global ambitions, Coates dove deep into cool new applications he hopes will grow out of the quest, regardless of whether the lab actually comes up with a single set of formulas. Following is an edited version of our conversation shortly after he joined Baidu.

Q: What interested you about machine learning?

A: I learned to program in high school as a hobby. One of the things that excited me about it was that it’s this very universal skill. It doesn’t matter what you’re excited about, whether it’s language or graphics and art or speech and books. Programming is the skill you can take with you anywhere.

When I took the first courses in AI and machine learning, it dawned on me that this was the next step and another universal skill. It doesn’t matter whether you’re really excited about language or flying helicopters, you can take it and use it to solve any problem. So my career has wandered through lots of applications--controlling robotics, control systems for helicopters, computer vision, and now working on these deep learning algorithms and neural networks that have all sorts of applications. These techniques are powerful enough to give you a toolkit to solve interesting problems.

Q: So the problems you can solve, or want to solve, can be almost anything in a sense.

A: That’s right. For awhile, we worked on robots, and we wanted the robots to do really simple things: Go and pick up coffee mugs and clear our table after our meeting. It turned out that even something as simple and narrow as it seemed to us turns out to be really hard. In order to recognize objects and images, you’ve just got to have a lot of understanding about how the world works.

One of the tough realizations of AI is that even the simple things that don’t look like intelligence turn out to be really, really hard anyway. There’s an enormous amount of background that you use to make these decisions that as human beings we don’t realize we’re using. To just recognize that this [tabletop conference call device] is a phone, for example, where there’s nothing especially about it that looks like a phone, that’s relying on a huge amount of contextual and background knowledge that we do without any help. We need to solve a more generic problem of how do I learn about the world first so then when someone asks me about a telephone, I can answer very quickly.

Q: It’s kind of hard to know where to start that way, isn’t it?

A: It’s a big challenge. Do you need to build a full strong-AI system like a human being where you can answer questions like the telephone? Or can you do something narrower that only solves part of the problem but gives you good performance? Do they need the rest of the brain to do what they do? How much do we really have to build into the system to get it to do what we want?

Q: Have you come to any conclusions on that with regard to your work here at Baidu?

A: In the history of AI, we’ve gone down so many paths that it would premature to say we have any one right answer. But I think we’re seeing lots and lots of progress using otherwise very simple neural network models. They’re sort of comical interpretations or models of what the brain is actually doing--a very, very loose cartoon. They’re nowhere near as complex [as the brain] in terms of how they operate. It may well be that the things we come up don’t match what the brain is doing. But if we can get machines to do a lot of the things the brain does in terms of tasks, that’s the first step.

The brain has a huge advantage on us in terms of hardware. You have trillions of little connections in there doing their own processing and we do not have chips at this point that are flexible enough that we can prototype something like that. We sort of have to make do with the hardware we have available and what we can feasibly do research with. So right now we’re in a spot where we have a model that works really well and gives us good results and lets us explore lots of questions and maps well to the kinds of things computers are good at.

Q: Are you looking at new kinds of computing of the sort that neuromorphic chips are aimed at?

A: We’re always happy to have more computing power. But our ability to try out new ideas and new algorithms relies very heavily on having the flexibility of programming languages. Computer programming is this nice skill because it’s this nice abstraction that hides all these other complexities from us so we can just try out an idea.

But the challenge is how do you bring that hardware to researchers in a way that they can quickly try new things. We don’t know what the right answer is. I couldn’t tell you what algorithm to put in hardware to make the next step in AI work.

Q: What kind of model for a research lab do you see working best here?

A: There will be multiple research labs with different purviews. There may be more labs out here eventually. Broadly our view is to be future-oriented--figure out what is the next big step in AI. One of our best bets because that’s where the light is is to continue looking for new algorithms around the deep learning space. But we’re definitely keeping our eyes open for other approaches.

It turns out that a lot of these AI results coming from industry and academia very much rely on infrastructure that we’re all building together. So one early component of our lab would be to be building this infrastructure. As we come up with new ideas or we see a new area that we think is really promising, we give ourselves the ability to rapidly prototype and make progress that way.

Q: What do you mean by infrastructure?

A: One of the big things about deep learning in particular is the amount of computing power needed. While it’s fairly straightforward at this point to write a computer program for your desktop computer, the tools and the set of things you need to know to make things work well on many computers to harness a lot more computing power is a lot more complicated.

One example is GPUs [graphics processing units, or chips used in graphics-intensive applications such as gaming], which are very popular right now. But if you want to squeeze the most performance out of your GPU and really use that to get to the next level in terms of the sizes of models you can handle, you’ve got to learn a lot about GPUs and what they can do and what they can’t do and how to program them. So one of the things we have to think about is how to build software on top of those things so that researchers can very quickly exploit that computing power to go try out new ideas.

Q: GPUs have been around for awhile, even for this purpose, haven’t they?

A: There is a lot of existing software for this stuff. Definitely if you want to use one GPU or even a small number of GPUs to do your work, there are software systems now that can help with that. To the extent that our research at Stanford has worked on this and many other groups are already working on this, there is a lot of infrastructure that already exists. The question is as we learn more about deep learning, how to make bigger models. Perhaps there are models that are not necessarily bigger but more expensive to train because they’re making more extensive computations.

Q: What’s the state of the art today?

A: The state of the art is really the system from my and Andrew’s group at Stanford--64 GPUs. That’s a pretty typical high-performance system. Once you pass hundreds of GPUs, you’re getting into supercomputer territory, where the set of issues you need to deal with to use all that computing power at once become very challenging.

Q: Is that something you intend to work on?

A: It’s something we thought about. But we’re pragmatic in terms of wanting to find the next step to help us get better performance on applications that we care about. And if it turns out that having a 100-GPU supercomputer is the thing to do it, then I think we can have a team and infrastructure to do that. But if it turns out that you can do it on some other resource or fewer resources, we’re happy to do that too.

Q: When would you have this infrastructure in place?

A: We’re talking small numbers of months. We sort of know what are the crucial pieces to build for current state-of-the-art deep learning algorithms. Then what do you do for the next big thing, the next-generation algorithms we might want to try, that’s still to be figured out, codesigning this with what deep learning researchers are coming up with.

Q: Once you have that infrastructure underway, what’s next after that?

A: There are lots of different applications of deep learning that we’ll want to look at. One of the early things we’re going to look at is computer vision. As of today, a lot of the state-of-the-art systems for various computer vision tasks rely very heavily on annotated or tagged data. Somebody basically has to tell me what is the right answer for every image I see. I see an image, you have to tell me there’s a car here. Unfortunately, it’s not clear this is enough to get human-level performance or to do all the things with images that humans are able to do.

To get to the next step, we have to find a way to make use of all this untagged data that we have out there in the world. The Internet has massive amounts of images, but unfortunately no one has been kind enough to show up and label all of the objects in them. As we get bigger models and set the bar even higher for the next set of results, it looks like one thing we may need is how to make use of all this untagged data. There have been some cool results, including the Google Brain team’s result showing objects without any labels. We’d like to look in that direction and find what is the next algorithm, what are the next few key ideas to make that work much better.

Q: Any in particular that maybe we haven’t heard of before?

A: The research community has been circling a lot of the same ideas for quite some time. The other thing that is really challenging is that there are all these little parameters you have to figure out, all these tweaks to the system, all these modules that you have to add in to make sure that the whole stack actually functions together in a really nice way. So I don’t think that it’s as simple as having the right algorithm. It could well be that a lot of the ideas we already have could yield big benefits but we just haven’t found the right mix yet.

Q: So you’re aiming for the kind of incremental, though significant, gains that we’ve seen in recent years from deep learning?

A: Right. But in terms of the actual performance, it goes to show you how important the details are. Certainly I know that the speech recognition on my phone works far better than it did a few years back. While big ideas are welcome, sometimes even existing ideas, if you can find the right mix, can give you a big bump that you didn’t realize was there.

The lion’s share of that improvement is coming from scalability. Once we figured out how to build a small neural network that did what we wanted it to do, we kept getting faster and faster computers and bigger and bigger datasets. All these things met up in the right place at the right time with the right people that we suddenly started getting much better results. So we have at least some hope that if we find the right mix again, if we do the same level of tweaking to find the right mix of algorithms, then there will be another big leap to be made with another step in scalability.

Q: Where do you think the biggest opportunities are for improvement?

A: Just perception applications in general. There’s one idea that maybe the brain could be using one algorithm to process any sensory modality. While it’s true that historically we thought of speech recognition and computer vision and touch as being very different things, our sense is that if you build a system that’s very powerful and very good at working with, say, computer vision, a lot of those insights turn out to carry over to something like speech and vice versa.

So what we’re hunting for is not a better computer vision system per se but a better perception system. So we throw you vision and speech in a video, for example. Hopefully you can take that apart, learn from it, and make predictions about it.

Q: How are you proceeding here on that front? You ultimately want to apply this to speech recognition or something specific, but for now are you working on a general perception algorithm?

A: Some of this is still to be determined. A lot of the things that we care about are image tagging and object detection--classic computer vision tasks that are very valuable to things like image search and standard Web-based products. There are a lot of deep learning experts already at Baidu who have really been trail blazers at getting deep learning systems into products.

The really exciting things, though, come from the next generation. The difference in what we want to do in the Silicon Valley lab is not just looking at what’s the thing I can do today to make the neural net in the cloud look a little bit better, or a particular application, but thinking about what’s the thing that will enable new applications, things that right now don’t exist.

Q: Like what?

A: If you take pictures of a room, for example, there are lots of interesting applications that help you find, say, take a picture of shoes and find me shoes that look like that. Or take pictures of people’s clothing and I like this particular piece, find me something like this on the Web.

But this is pretty challenging. How do I understand taste? You’re really getting to the stage where you have to take apart the image and understand all the different things that you’re seeing. If you have a much richer interpretation of what is going on in an image at a more semantic level, that lets you conduct the kinds of searches on behalf of the user that you couldn’t conduct before, at least not very accurately.

Q: So it would be taking it much further than Google’s cat demonstration.

A: We want computers to act on our behalf in the world. We would like to show them a picture and say, would you please categorize this with all the other photos from my vacation? Or could you tell who the people are in this picture? Can you tell me what these people are doing? In order to do that, computers need to understand the world. Right now, they don’t understand the world the way we do because they don’t have the same kind of experience. Whereas you and I don’t get to see a million examples of a cat, nicely labeled, we do get lots of unsupervised images, where we wander around the world and see how things work.

The hope is that if we can find algorithms that learn in the same way. And if we can do that, that enables these sorts of applications where computers understand how the world works or how the world is put together. That will let them make decisions that you and I would make when we see them.

Q: We’re only at the start of that now, then.

A: Exactly what is the algorithm that enables that is up in the air. So that’s one of the things the lab is being established to search for.

Follow me on LinkedInCheck out my website