Open-Source Solves Big-Data Problems: Talking To 'Mr. Hadoop,' Doug Cutting

As big data continues to push and stretch the limits of conventional technologies, Hadoop emerges as an innovative, transformative, cost-effective solution.

Hadoop is an open-source software framework for data-intensive distributed applications; it was created by Doug Cutting in 2006.

In these highlights from my interview with Doug, he shares his insights about Hadoop’s growing popularity and its value for tomorrow’s business.

What would you like to share with executives about Hadoop’s future? Why should our readers be investing in Hadoop now?

Businesses should be investing in Hadoop because it can help them solve the problems that they have today. In the long term, I think all the trends point to Hadoop becoming a mainstay of enterprise data computing, if not the mainstay.

It’s a general-purpose platform that will be able to handle most of the workloads that businesses are now doing with other systems, as well as new kinds of workloads that weren't possible before.

Research shows that 50% of IT organizations are either doing something with Hadoop or planning to in the next 12 to 18 months. Did you ever imagine Hadoop would get that big?

No, no, not at all. I was very lucky to happen across something that was becoming a big trend in computing. At the time, I thought, “There’s this wonderful technology at Google . I would love to be able to use it but I can’t because I don’t work at Google.”

Then I thought, “There are probably a lot of other people who feel that same way, and open source is a great way to get technology to everyone.” I understood from the beginning that by using Apache -style open source, we could build something that would become the standard implementation of that technology.

When you created Hadoop in 2006, the term “big data” hadn’t even been coined. What was the problem you were trying to solve back then?

Most of the technology that we named Hadoop in 2006 was actually stuff that we’d been building since about 2003 in a project called Nutch. The problem I was trying to solve was a very specific problem of crawling the web, collecting all of these web pages and building indexes for them and maintaining them.

For the Nutch project we needed distributed computing technology, we needed to store datasets that were much bigger than we could store on one computer, and we needed to have processes that would run and be coordinated across multiple computers. We saw the papers from Google—the GFS paper and the MapReduce paper—and we thought, “That’s the right platform.” So we set about rebuilding that platform.

Do you think business and IT leaders understand the value and potential of Hadoop? Do they get it?

I think they’re beginning to get it, yes. There’s been a lot of good writing about this trend, and I think people recognize the trends that are leading us here.

With hardware becoming more and more affordable, and keeping in mind Moore’s Law, you can afford a huge amount of computation, a huge amount of storage, yet the conventional approaches don’t really let you exploit that.

On the other hand, more and more of our business is becoming online business. Businesses are generating vast quantities of data. If you want to have a picture of your business, you need to save that data; and you need to save it affordably and be able to analyze it in a wide variety of ways.

In reference to Geoffrey Moore’s “crossing the chasm” theory, do you think that Hadoop has crossed the chasm from being an early adopter project to an “early majority”?

I haven't seen a real chasm that we need to cross or a trough that we need to get through. There’s a lot of tension, a lot of hype, but I believe that the level of adoption is steadily increasing.

People’s expectations are reasonably well matched. They understand that it’s a new technology, and they’re cautious about moving to it because it generally involves a big investment in hardware, and you have to train people. So they start [small], exploring the technology.

But the next year they double the size of their cluster, or they start another cluster. ... After that, it actually becomes a stable part of the estimated platforms. With Hadoop, I don’t see that overhang, where the adoption is over anticipated. Maybe I’m blind to it because I’m right in the middle of it, but it seems like the expectation is that it will be a big part of computing.

On a personal level, what inspires you? What drives you to solve big challenges?

I like to think about technologies that will make a difference. I’ve always loved open source because it’s such a tremendous lever. What I look for is a way to find the smallest thing I can do, with the least amount of work that will have the most impact. Where is the leverage point?

Hadoop came out of that. We needed to do some vast computing, but I also saw a lot of other workloads that could benefit from this.

These are just the highlights from my interview with Doug. For the full text, read my IT Corner blog post.

Open-Source Solves Big-Data Problems: Talking To 'Mr. Hadoop,' Doug Cutting

More from NetAppVoice: