BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Information Governance Even More Important In The Era Of Big Data

This article is more than 10 years old.

Information is arguably the most important fuel businesses run on.  Intellectual property such as patents, institutional knowledge collected and stored by employees, sentiment gleaned from millions of social media posts, and consumer insights from the analysis of myriad online transactions are just a few examples of information assets companies leverage today.

In the Big Data era, where companies are able to quickly make sense of larger quantities of data than ever, information is finally recognized as a critical business asset.

But too many companies are on the “Big Data Bandwagon” without even a thought of how it could affect eDiscovery costs or risks.

For more than a decade, eDiscovery existed in relative obscurity.  While a select few recognized the skyrocketing costs and growing risks that digital information represent in eDiscovery, the topic did not achieve mainstream recognition until at least 2008.  And even then, as now, eDiscovery – and information governance in general – does not get the respect it deserves.

Today, the all-too-pervasive attitude is that storage is cheap and we can churn through information more quickly with technology.  Companies need to wake up to the reality that information governance is more important in the era of Big Data than it was beforehand.

New Big Data tools leveraging technology such as Apache Hadoop can process and analyze high volumes of data at reasonable costs, creating business intelligence that companies can use for competitive advantage.

Sears recently invested in a Big Data program to improve customer loyalty and regain some of the ground the company has lost to Amazon.  According to Phil Shelley, Sears' executive VP and CTO, “With Hadoop we can keep everything, which is crucial because we don't want to archive or delete meaningful data.[1]”  The key phrase here is: meaningful data.  How does a company know what data is meaningful?

Business Intelligence (BI) programs can make sense of structured data, giving companies a good – or even exact – sense of what data is meaningful.  But, what percentage of a company’s information volume consists of structured data?   Most would be surprised to learn that the percentage is fairly small.

A large insurance company engaged me in a consulting project five years ago.  The company wanted to create a database archive for its operational data systems as it was filling up its storage capacity.  An analysis of the storage volume revealed that almost 90% of the information stored was from messaging systems (mostly email), document imaging systems (storing scanned contracts and claims), and report management systems.  The volume of structured data was relatively low, meaning that database archiving would not reduce the storage footprint by much.

The moral of this story is that information hoarding in order to leverage Big Data tools may work in the structured data world, but will not work in the broader information world that includes unstructured content (the Word file, PowerPoint presentations, CAD files, audio files, and other large data types that dominate corporate information stores).

The risk inherent in keeping unnecessary information – such as paying to process and review it in the event of litigation – is rarely a consideration at all.  When eDJ Group and ViaLumina, Ltd. conducted an information governance (IG) survey recently, less than 30% of respondents indicated that Big Data Governance was in the plans within the next year.

In order to satisfy both masters – the businesspeople that need BI for better decision-making and those that recognize the need to control IG costs and risks – companies need lose the hoarding mindset and embrace defensible deletion.  This is especially true with regard to unstructured content, most of which is duplicate information or unnecessary (think of all the junk and transitory email).

Defensible deletion sounds good in theory, but the question becomes: how do we know what to delete?  Current methods of information classification are inconsistent and do not scale well.

In addition, defensible deletion is not yet a mature practice.  In the IG survey, respondents indicated that most information is not being actively deleted.

The most deleted content?  Email – just over 50% of respondents are actively expiring email content.  The problem with the current state of deletion of this type of content is that most of it is time-based deletion, meaning that companies delete email after a certain amount of time.  That could lead to deleting valuable information.

What is needed is a way to analyze information automatically (or at least semi-automatically, with some human review) to judge its business value.   While BI has gained mainstream traction in the structured data world, content analytics have not yet in the unstructured content world.  There are certain processes in which content analytics have proven useful – early case assessment, predictive coding – but not for automatic classification just yet.  eDJ Group is currently conducting a survey to go deeper into attitudes on defensible deletion and will summarize the results in this blog.

What companies must understand is that Big Data and the intelligence it can deliver is good and worthy of embracing.  But, the governance aspect needs to be embraced in parallel or eDiscovery nightmares will crop up down the road and bring with them huge costs and potential sanctions.

Effective information governance not only helps make business operations more efficient, but also mitigates risk. Most organizations are so busy just trying to manage structured information that they haven't yet addressed unstructured content, much less given enough attention to litigation risk associated with information.  Now is the time.


[1] Henschen, Doug.  Why Sears Is Going All-In On Hadoop.  InformatioWeek.com. October 31, 2012.