The Heartbeat Of A Data-Driven Culture: How To Create Commonly Understood Data

The whole world has fallen in love with the value that data can bring. The work that must be done to unlock that value is far less popular. As Thomas Edison said, "Opportunity is missed by most people because it is dressed in overalls and looks like work.”

In the case of data, the fun stuff is playing with technology, distilling massive data sets using cool stuff like Hadoop, blending it together with systems like RedPoint or Alteryx, knitting it all together in Teradata, and then perhaps making a nifty chart or data discovery environment with technologies like Looker, Tableau or Qlik. Well, actually there is a lot of work in there, but it is an exciting to master a tool or discover something new.

The problem is that all across the world, such activities take place without first doing the work to determine the definition of the data going into the process. Where did this data come from? Is it the right data? Are we allowed to use this data? What other choices could we have made? How was it transformed? Are there any quality problems? And most important: Do we all understand this data in the same way?

All of these questions fall under the rubric of the snooze-inducing term data governance. If there was ever something that shows up dressed in overalls and looking like work, it is the activities needed to execute a program of data governance.

But let’s say that we answer all the questions listed above: How much more powerful does data become? How does it change the relationship between people and data in a business?

The Pitfalls of Ungoverned Data

In my view, using ungoverned data isn’t the worst case—using no data is worse. But data that doesn’t have a clear definition as shown in Figure 1 leads to two situations, neither of which is good.

The first is the Data Rorschach Effect, which happens when people look at data and just see whatever they want to see. If everyone in the company is looking at their own dashboard created to be just the way they want it, there is no way for a consistent big picture to be assembled. The definition of the customer in one report may be completely different that the customer in another. Same goes for revenue and expenses. Furthermore, everyone may have a different model of what drives growth. If data is not clearly defined, it can be a springboard for free association, often self-serving free-association.

The second situation is the Data Brawl, which occurs when someone uses data to draw a conclusion that is damaging to someone else in the company. The CFO says that sales are dropping because she is only counting closed deals. The VP of sales says sales are rising because there are a record number of deals in contract negotiations. The first response by the injured party is usually to attack the quality of the data. This then leads to an acrimonious battle which good data governance would have avoided.

But How to Implement Data Governance?

I feel about data governance just as Marianne Moore said of poetry, “I too dislike it.” But as Moore noted in her poem, "one discovers in it after all, a place for the genuine.” To me, data governance is all the work that must take place to make data into poetry, something that is distilled, compact, and radiating with meaning. I may dislike the work, but, as Moore does, I love the result.

There are many roads to data governance, but I recently came across a company, Collibra, that is taking an approach that has a strong chance of working in many companies. As I have in past stories, we are going to look at the essential idea of data governance by examining the approach that a vendor, in this case Collibra, is taking to solving the problem.

If you look up what management consultants and lots of other experts have written about data governance, you will get a lot of complex gobbledy gook about ornate processes. Data governance can be complex, but to me the fundamental ideas are simple. The goals of data governance include:

Truth: Create an agreed upon, commonly understood, searchable, integrated model, definitions, and catalog of the data that describes a business.
Communication: Document the model and definitions in all its forms so that people using data can conveniently know what a particular field or set of data means.
Change: Implement a process that allows the model and definitions to evolve and grow through a team-based process in which everyone plays the appropriate role.
Convenience: Integrate and automate related processes for data quality, granting access, updating databases schemas, publishing metadata, and so on to achieve these goals.

I think the experience of Warby Parker, the innovative eyeglass retailer, shows this quite well. Lon Binder, the CTO of Warby Parker, and Carl Anderson, the director of analytics who recently published Creating a Data Driven Organization, implemented a streamlined approach to data governance in the following way:

The data describing the company’s operations was landed in a SQL data warehouse.
The analytics team analyzed all the concepts used by the business staff and analysts at the company, which were mostly embedded in spreadsheets.
A new set of definitions for the various types of customers and revenue and so forth were implemented using Looker, which allowed analysts to easily dive into the data and summarize it and explore it using pivot tables.
The definitions were documented in a GitHub Git Book repository, which was used as an integrated catalog of all the data at Warby Parker.

The result was a simple and highly functional data governance process that works for a mid-sized, but data-obsessed organization. More can be found in these stories that refer to Warby Parker (“Why You Can’t Be Data-driven Without A Data Catalog” and “Why Digital Paper Is Killing Efficiency and How To Stop It.”)

How Collibra Delivers Data Governance at Scale

So now imagine that you have a team of hundreds or thousands of people using data scattered all over dozens of major applications and data warehouse and analytics system. You don’t have a data warehouse anymore; you have a data supply chain. Customer data may be in 10 different places. Even if you have a master data management system to collect all of the master data and make sense of it, you still have the challenge of managing the process of agreeing on the basic concepts.

In such an environment you will always have a heterogeneous set of applications and data analytics repositories. The idea of one repository to rule them all where the governance will happen is a fantasy.

In addition, you will have data that is owned by different parts of the organization and that will need agreements about how data is shared and the service levels that describe how data will be maintained.

This is the world that Collibra is focused on the one shown in Figure 2. The software is constructed and used based on the following assumptions:

The integrated catalog is going to start small and grow and change at a rapid pace.
You need a process for managing the evolution and expansion of this catalog and communicating about the changes; tribal knowledge will no longer work.
The management of the integrated catalog must embrace a heterogeneous implementation and use of the catalog and definitions by many different types of tools.
Automation of this process will not be complete, but will gradually grow as mechanisms of integration mature.

Collibra CEO Felix Van de Maele, who co-founded the company based on research done in graduate school, realized that data governance is a process that is never truly complete. It must be supported as a moving vector, not as an end state. “Our goal is to just find somewhere to start at a customer, to find an important data set that needs to be trusted and understood in the same way across the business,” he says. “Once we show what we can do, the data-obsessed in the company arrive and want help in making their data a shared asset, and adoption just blossoms from there."

Implementation of the Collibra vision shown in Figure 3 is based the following principles:

The integrated catalog and definitions are independent of any implementation technology.
The catalog is searchable to allow analysts to find governed and approved data available to them.
During construction of the catalog, auto-discovery is used wherever possible to import implementation layer data and metadata.
The business-focused integrated catalog is created and connected to the logical and physical model through a process of design, construction of standards and policies, and formal approval, all controlled by workflows.
A ticket-based process is used for addressing problems and changes.
Integration is supported with a variety of implementation technologies to perform various data governance related tasks:
- Harvesting of physical layer and other intermediate data models from databases and programs like ERwin.
- Delivery of metadata describing lineage of fields to analysis technology like Tableau.
- Automatic creation of new fields in databases based on updates of the catalog.
- Certification of output of reports and analytics systems as based on accurate and approved data.
- Integration with data quality tools to report on the quality of the data and efforts to improve quality.
- Other integrations are added frequently as part of the Collibra Connect part of the product.

When companies use Collibra to implement a data governance process and create a shareable, commonly understood data catalog, they take an incremental approach:

First, a high value report or dashboard becomes the focus.
Pre-existing data definitions are imported into Collibra.
The physical data models are imported into Collibra.
The team who is going to construct the catalog is defined and assigned roles inside Collibra
The tasks to create the catalog are assigned to the team and proceed through a defined workflow.
Tasks include creation of concepts and definitions at a business level, creation of the logical and physical data models, documentation, integration with varying levels of technology for modeling and data quality, and approval.
The result is a business level catalog that is connected as much as it can be to all of the mechanisms for using data, monitoring it, and ensuring data quality.

This process is then repeated over and over again to gradually build out a comprehensive catalog of the crucial data in a company.

In my view, Collibra’s vision has a variety of advantages:

The catalog in Collibra assumes a heterogeneous storage of data in many technologies. Many other technologies of this sort assume that one repository or technology will be used as the center, which limits the scope of data that can be governed.
Collibra puts as much focus on the process of creating, updating, communicating about, and using the data catalog as it does on the catalog itself. The to-do list for all the work of data governance is managed, not just the catalog.
Collibra doesn’t rely on pervasive automation. In other words, you can notice problems and assign people to fix them in other systems.
Collibra is useful without the Collibra Connect integrations with other technologies, but becomes more powerful as more integrations show up. The integrations provide something extra such as reporting on data quality, adding metadata to analytics tools, or the ability update a schema in a database by updating the catalog, but do not get in the way of the core value created.

The biggest worry I have about Collibra is how to motivate people to participate.

When the data catalog is complete, it becomes very useful to analysts who want to find high quality data, but there is energy required to build enough definitions so that it becomes a benefit to analysts.
Lots of the benefits of the commonly understood, integrated catalog are only clear after you have suffered from not having agreement on what the data in your company means. Do people have to suffer first to understand the value of Collibra and a commonly understood data catalog?
The work to create and maintain the catalog is substantial at first and not trivial as the catalog grows. How do companies recognize in KPIs and MBOs that this is important work for everyone, not just for the data governance team?

The crucial challenge facing companies today with respect to data is not how they will address the fun parts. The fancy toys will be bought because they are exciting. The companies that make the most of the data they spend so much to collect will be the ones that have enthusiasm for the overall-clad work of data governance, that is, turning raw and messy data into something genuine and meaningful like poetry. When that occurs, changes in data lead to changes in action, not to confusion.

Follow Dan Woods on Twitter:

Follow @danwoodsearly

Dan Woods is on a mission to help people find the technology they need to succeed. Users of technology should visit CITO Research, a publication where early adopters find technology that matters. Vendors should visit Evolved Media for advice about how to find the right buyers. See list of Dan's clients on this page.

Follow me on Twitter or LinkedIn. Check out my website.

More From Forbes

The Heartbeat Of A Data-Driven Culture: How To Create Commonly Understood Data