Java Creates New Big Data Opportunities

By Roger Smith

The Java ecosystem is growing and evolving—thankfully. And that means better business applications for real-world scenarios, and more opportunities for Java programmers to get involved in everything from the Internet of Things to online fraud detection.

Over the last five years, the amount of data companies collect about you, me, and everyone we know has exploded, based on our online behaviors in the form of website visits, clicks, likes, tweets, photos, online transactions, and blog posts. That data is then sliced, diced, analyzed, and fed back to us in the form of digital advertising campaigns. The volume of data being collected is enormous. Every minute of every day, according to analytics firm Domo:

Google receives more than 4 million search requests;
Facebook users share nearly 4.2 million pieces of content;
Twitter users tweet nearly 300,000 times;
Instagram users post nearly 1,750,000 new photos.

From 2013 to 2015, the global internet population grew 18.5% and now represents 3.2 billion people. That’s just people—imagine the data flow once things start getting online.

When billions of computers in the next generation of smart cars, TVs, and appliances—including those in my light switch, coffee maker, refrigerator, and blender—start talking to each other as part of the coming Internet of Things, the voluminous amount of data being mined for information will take another exponential leap.

Big Data and Java

Because all the data being generated by people and devices takes too much time and costs too much money to load into a traditional relational database for analysis, companies are adopting new analytics and storage approaches, described by the evolving term big data. Big data involves storing data in a data lake or storage repository in the cloud that holds a vast amount of raw data in its native format until it is needed.

Big data is often done using Hadoop, an open-source software framework written in the Java computer language. Hadoop allows data analysts to store large data sets across a large number of inexpensive servers and then run MapReduce operations on Java Virtual Machines (JVMs) in those servers to coordinate, combine, and process data. MapReduce takes a query over a data set, divides it, and runs it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. By combining MapReduce with commodity Linux servers wired into massive computing arrays, analysts get access to low-cost supercomputing resources from almost any device.

During a presentation at the recent JavaOne 2015 conference in San Francisco, Dan McClary, Hadoop and big data product manager at Oracle, encouraged Java developers to leverage their programming skills on big data business development projects as a way to boost their careers as well as the fortunes of the companies they work for. McClary lauded new advances in Java, but also warned developers to watch out for a few technical “gotchas” and not fall asleep at the switch. “If you don't pay attention to how and where you write blocks of data, for example, you can end up with data nodes in particular sets of data instead of a nice, even distribution," McClary said.

Of course, the Java landscape is continuing to evolve. Over the last 18 months, McClary said, Apache Spark has become very popular in the big data space. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark (also written using Java) offers multistage, in-memory operations that provide performance up to 100 times faster than Hadoop and MapReduce for certain applications. "What's important from a programming standpoint is," McClary said, "it's two to five times less code because you have richer abstractions and functions that can be programmed in Java, Scala, or Python."

On the horizon, McClary mentioned the DeepLearning4J project for powerful machine learning, as well as the Apache Zeppelin project, which is designed to provide a simple environment for data science. "If anybody has an interest in machine learning, DeepLearning4J is absolutely worth checking out," he concluded.

Find out more on Oracle.com:

Roger Smith is a freelance technology writer.