BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Solving the Mystery of Hadoop Reliability

This article is more than 9 years old.

Amid the excitement over all the new developments and substantial progress at the Hadoop Summit there was a latent resignation. Right now, all over the world, long-running Hadoop jobs that have run for hours, or even days, are failing. Sometimes, the problem is bad code, sometimes bad data. But often the problem is, well, the problem is…. well, we just don’t know what the problem is. Start the F-ing thing over. Often that works, sometimes after a few tries.

Anyone who uses Hadoop on a regular basis knows that reliability of complex jobs is a mystery. The odds of a Hadoop job running according to plan every time are not 90 percent, or even 80 percent at most shops. Some Hadoop teams have turned this resignation into a sense of pride at their skill in managing the weaknesses of Hadoop. The question that interests me is: Why is this okay?

my car accident (Photo credit: bunnicula)

It is certainly okay with Amazon Web Services, which makes money on each execution, whether it finished or not. It seems to be okay with the people running the jobs, because debugging them would be more costly than just re-running them. It is far less okay with the business staff who are eager to get results. As Hadoop’s aspirations grow and the workloads start to become less batch and more interactive, such reliability will not be okay with anyone. Companies like IBM , Microsoft , Teradata , and Intel , all of whom are buying heavily into Hadoop, don't allow this in their other products. Surely as Hadoop becomes more important to the enterprise, reliability cannot be a mystery.

It is time then to ask: Why is Hadoop unreliable? How can Hadoop become more reliable? In this article, I will focus on the first of these questions with the help of Raymie Stata, CEO of Altiscale, and his engineering team. Stata, former CTO of Yahoo !, who was present at the creation of Hadoop, founded Altiscale to provide Hadoop as a Service. Altiscale is creating a purpose-built cloud that is designed to run Hadoop in a reliable fashion.

The idea is that you land your data in an HDFS file system in Altiscale’s cloud, and then make use of the Hadoop clusters as needed. The clusters grow or shrink depending on your workload. Stata says this approach will be more cost effective for Altiscale than using Amazon Web Services because his clusters will be able to become multi-tenant using the Docker technology, that Google publicly embraced this week.

Altiscale utilizes its own purpose built clusters in providing Hadoop As A Service - however, it does leverage some elements of AWS such as Direct Connect which provides high bandwidth connectivity for those customers wishing to move their data to and from AWS.

(Qubole, another Hadoop-as-a-Service vendor is taking the mirror image of this approach, landing data in S3 and using Amazon Web Services to run Hadoop. Right now, both Altiscale and Qubole seem to be thriving.)

Where to Look for Hadoop Reliability Problems

So, if you are fed up with just restarting jobs (Qubole actually has an auto restart feature to help automate your resignation), here is where Stata suggests you start looking:

<) Setting your memory sizes incorrectly is a huge source of failures. “Hadoop is pretty picky about how much RAM needs to be allocated to different tasks,” Stata said. “If the setting is too low, tasks will fail because they need more memory. If the setting is too high, jobs will take longer because resources are being wasted.”

<) For jobs that have a lot of intermediate results between the map and the reduce phase, you get a problem called spill. Usually the mapper output can fit in memory, but if it is too much to fit in memory, it starts to spill to disk. If you have nodes that have insufficient IO to handle the additional overhead imposed by these spills, you can have problems.

<) When using Hive, jobs often fail because queries are structured in ways that create excessive intermediate results, which fills up storage and causes failures.

<) If you are running Hadoop on AWS and you get spot instances as a way to save money, your jobs can fail if these spot instances disappear at the wrong time. You need to have the right mix of reserve, non-reserve, and spot to make it all work — a mix that’s hard to get right in the first place, and that changes over time as your workloads evolve.

<) If you have a large Elastic Map Reduce job and you need 100 nodes, or 500 nodes, asking for those all at once, is a recipe for failure. This is a lesson learned the hard way, Stata says. “You have to go through this dance where you get the first 10, and then add to it, and then add to it,” Stata said. “Each attempt to expand the size can fail but since you are only asking for 10, it will fail relatively quickly. There is an art to building very large clusters.”

<) Errors in configuration files are a common source of job failures. Config files need to point to right bucket, right have credentials in right place, and be free of typos. These files tend to be copied-and-pasted across an org, and eventually become “black magic” that folks don’t understand – so when they break, lots of time is lost.

<) Default settings that come out-of-the-box with Hadoop are often suboptimal. For example, default settings for so-called “slow-start” is too fast, which gets Hadoop deadlocked. As this example indicates, these settings are often obscure, compounding the problem.

<) EMR has a mechanism called “bootstrap scripts” that allow a program to initialize a cluster before a MapReduce job is run. These scripts are often fragile, causing jobs to fail. This is especially the case because teams tend to copy and paste these scripts, so over time the organization builds up a hairball bootstrap script that’s fragile but hard to change because of a bunch of dependencies people don’t understand.

<) Excessive run time can effectively mean a failure. For example, when a job takes too long to run – 2x, 3x as long – because of data skew/imbalanced partition, poorly-behaving nodes, or map-splits are either too big or too small. Often these jobs are killed and restated several times before the cause is found.

<) Even when a job is working, over time changes occur that cause it to start failing: partitioning of data changes, formatting of data changes, or ranges of values change – suddenly software that worked stops working.

While it will always be possible to write a Hadoop program ineptly, the goal of Altiscale and Qubole and other such vendors is to take as many of these problems off the table as possible. In the meantime, getting smart about why jobs fail is a better policy than just starting them over, and over, and over.

Follow Dan Woods on Twitter:

Dan Woods is CTO and editor of CITO Research, a publication where early adopters find technology that matters. For more stories like this one visit www.CITOResearch.com.