BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Do You Suffer from the Data Not Invented Here Syndrome?

This article is more than 10 years old.

Head of android Data (Photo credit: Wikipedia)

There is realignment rapidly taking place in the business world in which the data created and maintained outside your company is becoming much more important than the data that you can acquire from internal sources. Yet, few companies realize this and fewer are taking action. Instead, they are suffering from the Data Not Invented Here Syndrome.

In the next year or two, I believe that this realization will become widespread as innovators show the business value of using the vast trove of external data. Right now, at most companies, most of the data used is home grown. In the future, most of it will be externally created.

As usual, a few smart guys and gals have figured out parts of this story and are working on making products to help make it easier to access and use external data. I’m going to present this analysis as a sort of detective story and show how the idea came together in my mind, and then suggest what can be done about it.

High Resolution Management: The Fundamental Impact

Enterprise applications and business intelligence systems provide us with a model of reality that allows us to track activity, to perform analysis to identify patterns and important events, and increase automation. But most of the systems we use were built in an era of information scarcity. They have a low-resolution view of reality, a one mega-pixel view, if you will.

“During the decade I spent working at Teradata with the Fortune 1000, the common theme was that about 15% of the enterprise’s data was actually being stored, and you would be lucky if you realized the maximum value of that 15%” Said Jim Kaskade, CEO of Infochimps, “What’s sad is that a decade later, this is still the case.”

Now, from many external sources, the data is becoming available to get a high-resolution view, and by that I mean not just a 1 gigapixel view, but a 100 or 1,000 gigapixel view.

For example, in a retail scenario, instead of just getting information about when a customer made a purchase, you will be able to get information about what led up to that purchase, what web pages they looked at, what other events are going on in their lives, what they said in social media streams like Facebook, Twitter, or on a blog, and so on. Companies are just now beginning to understanding the benefit of this high resolution view.

Professor Elgar Fleisch of the University of St. Gallen has a name for what’s possible with far better models: high resolution management.

“You only can manage what you can measure,” said Fleisch. “With low cost sensors all over the place you can build a low cost MRI for managing physical processes just as Google measures every page view, click, and mouse movement to fine-tune its advertising business.”

But to make use of this high resolution view, we must have advanced systems that can make use of this level of detail. Machine learning applications like the ones from Opera Solutions (“Signal Hubs, Apps, and Products: Platform Design for the Next Generation of Machine-Learning–Based Applications”) are one way of gaining value from high resolution models. Using Splunk (“How Machine Data and Operational Intelligence Can Supercharge Business Applications”) and other systems to create rules to recognize important events is another.

The high resolution model of your business won’t arrive fully formed from a vendor and you won’t be able to create it in one pass either. Rather, a high resolution view will emerge in a patchwork as more data becomes available and better models emerge. Seek the tools and data that enable the creation of high resolution models with narrow scope that focus on key process and customer areas, and then build a large model incrementally.

How Factual Adapted Bill Joy’s Wisdom to Data

Let’s say you agree with me that each business will have to gradually assemble a high resolution view. Now we must figure out how will this happen? I think a great example is provided by Factual, a company founded by Gil Elbaz, who, at his previous company Applied Semantics, contributed some of the early technology that powers Google’s AdSense business. At Factual, Elbaz and his team are pursuing the notion that to create a better world we need better data.

The simplest way to understand what Factual is doing is to recall Bill Joy’s comment about talent, "There are always more smart people outside your company than within it." Factual is applying that principle to data and recognizing that it can find parts of great data sets all over the place and then assemble them.

Factual is creating a curatorial engine that seeks out data created at many other companies and organizations. Using machine learning and other advanced technology, Factual has created an emerging data economy that encourages people to trade the data they have for improved data.

"Much of the data used in organizations is non-proprietary and is applicable across many different companies. By being strategic about outsourcing the compilation and maintenance of non-proprietary, common data, companies can free up resources to apply to deriving value from their data,” said Elbaz. “Additionally, as more companies outsource the same data, a common view of this data emerges around the world, decreasing integration costs for everyone."

The key point for this discussion is that Factual realized that great data sets exist in parts all over the world. While Factual is breaking new ground with advanced technology and data science to create high quality data sets out of parts, CIOs and CTOs don’t need to be so advanced. It is vital to start looking for relevant external data that exists in parts or as a complete whole. Then your high resolution model will come to life.

How Do We Acquire and Manage the External Data?

There will many ways to acquire external data. The next few years it will be a free for all. Open data initiatives, data marketplaces, definitive data platforms like Factual, commercial data providers, and companies that have valuable data will all try anything and everything. The advantage will go to companies who figure out what works first. It won’t take long for those with data that can help improve the resolution of models and recognize important events to understand the value of their asset. In other words, data acquisition will become a form of business development.

"With thousands of web and premium sources of data and an ever increasing number of open data APIs, it's critically important for organizations to know how to find high value external sources of data and bring this data into the mix with their own corporate sources of data to drive richer insights and an outside-in perspective," said Sharmila Mulligan, CEO of ClearStory Data, a Palo Alto startup.

While there is much more to be said about how the market for data will develop, there are three large changes that companies must come to terms with when using external data:

  • The data may have a different structure. It is not just records and columns, but includes raw text, machine data, and more.
  • The data must be managed and analyzed using different tools such as Hadoop and Splunk.
  • The data will also be acquired in new ways.

One big question about external data is: Do you need to move it? Often, you can leave the data where it is and access it via APIs. Apigee, an API management and infrastructure platform, is focused on helping people use APIs to both gain access and to provide access to external data.

“In the app economy you don’t control all the relevant data anymore. If you don’t have access to what you need you could be losing out on valuable business insights. If you can’t provide your partners with the data that they need, your partners will suffer,” said Anant Jhingran, Vice President, Data at Apigee.

For example, if a company is using your API, it makes sense to attempt to find out what the customers are doing on the site outside the use of your API. By getting access to this information, which may be provided via an API, now you have a complete picture. APIs can be used to access data of any sort that is outside the four walls of your business. “In the app economy, combining these ‘weak signals’ with your own ‘strong signals’ inside the enterprise can be source of competitive advantage,” said Jhingran

In addition, Apigee has realized that intelligence must be added to APIs to allow them to distribute analytics and machine learning for distilling data and recognizing events closer to the data.

While Hadoop provides a repository for all of this data, it is also vital to have a nervous system that can reach out, often through APIs, to suck up data from anywhere, clean it up if need be, and deliver it into Hadoop or some other repository. Splunk is a technology that has been adapted from its roots in data center operations to play exactly that sort of role.

"Splunk can read data from nearly any source, from web servers and custom applications to social media and sensors,” said Guido Schroeder, senior vice president of products, Splunk. “Customers quickly realize they can increase the value of their data by enhancing and correlating it with data from external sources, whether for security or web analytics."

External data will also arrived in packaged form. Companies like ClearStory Data and Alteryx and traditional data providers like Dunn & Bradstreet are packaging up external data and providing it for analysis. ClearStory Data, a startup that has not revealed its capabilities, has declared that it will be in the business of providing a platform and user model that makes it easy and intuitive to find high value external data from numerous trusted sources and cross-pollinate this data with private and corporate data for fast, richer analysis. Alteryx provides US Census data and other syndicated or packaged data. Traditional data providers offer subscriptions to data sets that are delivered in various forms.

“For strategic analytics, it is imperative for companies to combine internal data with external content, or other data like social media, in order to get the most value,” said George Mathew, Alteryx President and COO. “However, that is just the start of the process. More important is what companies do with the data. They need to make the data actionable and leverage spatial or predictive analytics to add real intelligence into the analytics process.”

How Do We Create Value From External Data?

The world of business intelligence, the mission to acquire and use data effectively for business purposes, has been overtaken by two trends, both of which are highly relevant to making use of external data: big data and data science.

The domain of big data refers to the volumes of data being created in all sorts of ways that require new capabilities for management and analysis of the data. The domain of data science refers to the new challenge of using software, analysis techniques, and data never before available to find new insights in using data. Both big data and data science are heavily involved with external data.

But there is a danger in these new developments. Just as data warehouses and data cubes used for OLAP became a bottleneck, so may the world of big data and data science could become a bottleneck too. Big data, data science, and external data will never reach their full potential if they are only practiced by a few. The ability to use data - big, external, or whatever kind you have - must be democratized.

This goal has been pursued for many years in the domain of business intelligence. The established vendors have all attempted to democratize their offerings and new entrants such as QlikView and Tableau have offered ways to get more capabilities into the hands of more people so they could solve their own problems. When this has worked, the flowering of innovation predicted by Eric Von Hippel’s theory of User Driven Innovation has indeed occurred.

Now however, the landscape has become more complex. The data is bigger. It has many different forms and structures. In many cases, it arrives faster and must be analyzed right away in real time or near real time. The datasets are so large that they require automated machine learning or advanced techniques to make use of them.

"New big data platforms need to not only facilitate finding rich and relevant external data sources but should also help data managers and analysts converge this data with private data to arrive at new insights,” said Mulligan of ClearStory Data. “Further, when working with a wealth of external sources of data, next generation big data solutions will have to go beyond just analytics and will need to provide new user models that aid human insight."

So is a bottleneck inevitable? If so, does that mean that data science will be as frustrating as certain aspects of business intelligence?

It doesn’t make sense to build a data lake to store huge amounts of big data or to acquire lots of data from external sources and then not have a plan for that data to be used by as many people as possible. If history says that it’s difficult to analyze 15% of the enterprise’s data, what will happen when you can access 100% and then supplement that with another 100% outside of your own? The key to creating value from external data is to get as many people involved as possible using the sort of technologies mentioned in this article. Create a scalable data analysis platform that serves the end-to-end process of making data valuable.

“External data will only continue to grow exponentially. The point should not be to strike fear in the hearts of people, but to get people started on figuring out how to find out what data is going to be valuable,” said Mathew of Alteryx. “The goal is to find data that adds the most relevant context to combine with internal content to answer key business questions such as knowing what customers are saying to better market to them or what the community make-up is in order to sell new items by region to them. This hyper-local view let's you get to the specifics through a variety of data and allows companies to grow their customer base, increase customer retention and create new streams of revenue.”

This need to create a complete value chain that incorporates external data is widely understood. In addition to Alteryx and ClearStory Data, vendors like SiSense, Pervasive, Platfora, Pentaho, QlikView, and Tableau all offer different approaches to democratizing access to external data and big data. InfoChimps is creating a big data cloud that is optimized to connect to corporate data centers to allow companies to plug in directly to huge quantities of data without having to move it. Of course SAP, IBM, and Oracle are working on this problem as well. There will plenty more to write about this issue later.

So Why Aren’t We Looking?

One of the largest challenges in overcoming the “Data Not Invented Here” attitude is having a basis for understanding what external data is valuable. This is not an easy task.

Consider the example of Apple CEO Tim Cook and Steve Jobs before him. Both men were proud of being confident enough in their vision to say no firmly and frequently to many good ideas that did not serve the mission of the company. How could they be so confident in saying no? My view is that they had a clear view of where they wanted to go and could happily appreciate the value of ideas that were excellent, but which did not move Apple closer to its goals.

Most companies I talk to have no formal program for looking for valuable external data and putting it to use. But even if they did, how would it work without some idea of the kind of data that would help? In the absence of a vision for what kind of data you need, you may end up with a huge pile of data without an idea of what you can do with it.

I have a proposal for creating a vision that would provide those looking for data (or those building applications or helping to support a business in other ways) to understand which data might be valuable and which data to say a firm no to.

My proposal is to play what I call the question game, a way of understanding what the business wants to know. By harvesting, analyzing, and prioritizing the questions from all of the lines of business, it is possible to understand what data would be relevant. In addition, the output of the question game can be studied by any number of people so that many people can join the search for data.

Get Started Already!

The point of my argument is that there is much to be gained from systematically attempting to overcome the Data Not Invented Here syndrome and using external data to help run a business.

Be skeptical. Don’t believe me. Beat up my analysis. But, while you do that, start looking for data.

Follow Dan Woods on Twitter

Dan Woods is CTO and editor of CITO Research, a publication that seeks to advance the craft of technology leadership. For more stories like this one visit www.CITOResearch.com. Dan has performed research and writing projects for Apigee, Alteryx, Factual, Pervasive, Splunk, SAP, and Oracle, and many others in the business intelligence and big data arena.