The SQL Empire Strikes Back To Claim Agility

Hadoop has introduced a lot of agility into the world of analytics and data integration. Concepts such as schema on read and the radical flexibility of processing data stored in files have allowed those with the right skills to move fast compared to what can be done in the world of SQL-based data warehouses.

But through a variety of technologies, some of which were born to serve Hadoop, the SQL empire is now striking back to claim agility. A new group of technologies have been developed to address some of the long-standing challenges with the data warehouse ecosystem and with SQL databases in particular. By putting these technologies to work in intelligent and appropriate ways, it is now possible to move just as fast in the world of SQL as you can anywhere else.

Ending the DBA Bottleneck

One of the long-standing complaints about the world of SQL, especially when constructs like data cubes are employed, is that it takes forever to change the database scheme to introduce a new field. An analyst who needs to add new fields to an existing database, or perform a very specific search, often has to rely on a database administrator. Inevitably, this creates a bottleneck, slowing the process of discovery and frustrating the analyst. But as I’ve written before (See “Two Roads To Instant Big Data”), companies are overcoming these challenges and in the process, radically transforming the SQL-based data warehouse into a responsive and flexible repository.

This magic trick is performed by modeling the structure of the data warehouse in an abstract form and then using that model to generate all of the detailed configuration that a DBA would otherwise have to deal with. WhereScape and BIReady, recently purchased by Attunity, are the two most advanced companies in this field.

By using this approach, a DBA can do in minutes what may have previously taken days or weeks. Wherescape and BIReady are making it much easier to set up new models, make system-wide schema changes or map the flow of one part of the data warehouse to another. The API-driven experience of Amazon Web Services is re-created on top of the data warehouses of Teradata, Oracle, Microsoft, and IBM, among others. In addition, the risk of error goes way down. When a DBA faces the airplane dashboard of configuration options in a data warehouse, they are rightly worried about making mistakes. By using a model-driven approach, these details are taken care of for them. This makes change much less risky.

With these types of tools, SQL-based data warehouses are no longer like your ninety-year old uncle, yelling at everyone to keep off his lawn as he remains impervious to all cultural changes of the last forty years. Instead, DBAs and analysts can make changes rapidly with the confidence that they are not breaking anything...

Data and ETL Modeling

The next step is to go from a data warehouse to a rendering of information that better models the business. Right now in most companies the definition of key concepts resides in spreadsheet formulas. What is the detailed definition of a customer to the marketing department that includes all the complex rules? Go ask the marketing analyst who developed his spreadsheet over the past year. Look at the formulas across several tabs in a nest of spreadsheets, and voila - you’ve got your answer. Does anybody ever do this? No, which is a blow to transparency and confidence in data.

Both BIReady and Looker address this problem in different ways. BIReady creates an abstract model of the concepts in a business and then can use that model to generate both the schema and the ETL code to populate it. In this way you can have a central model that then can be rendered in full or in part to create numerous different data warehouses.

Looker uses a language called LookML to create its model of the business, which can then be explored through visualizations and pivot tables. Looker’s power comes from the fact that you can move all those complex definitions of business concepts from spreadsheets to the model. Looker then generates and executes SQL statements and caches the results so the analysts can easily traverse the model, drilling down and pivoting when needed, or returning to higher level summaries.

Both of these technologies amplify agility by using a central model that then supports automation of the brittle and manual process you find at most companies.

Data Warehouse Optimization

It’s unfortunate that optimization has become every bit the tech buzz word as disruption and innovation. In reality, most companies are still a long way off from figuring out how to optimize their data across all their storage platforms, which leads to a juggling act that often ends in bloated budgets, non- or underused data, and crucial inefficiencies. But there are technologies available that allow for seamless optimization between the SQL data warehouse, Hadoop, and all other repositories.

A great example of this is Attunity’s Appfluent, which was just acquired in March 2015. Too often, companies have little transparency into how data is being used, and by whom, and what types of returns they’re getting from the usage. Appfluent shines a bright light into this abyss. It provides remarkable insight through usage profiling, providing analytics on what data has sat dormant or was not being actively used or accessed. Armed with this intel, companies can then decide where that data should live, as the cold data is a prime candidate for replication onto Hadoop, which will vastly lower costs compared to letting it live in the data warehouse.

Appfluent supports not just Hadoop, but Teradata, Netezza, IBM, and Oracle’s Exadata, among others. Vitally, it offers users an analytic dashboard integrating all these platforms so that companies can decide what the right workload and data is for each platform. But this optimization is done within the context of the business, which is what truly sets Appfluent apart. So often, companies end up wasting money because they set up a data warehouse based on expected usage, and never reaffirm that reality ends up reflecting those assumptions. As a result, companies fail to make the most of their data warehouse investment. But now, instead of guessing what belongs where, companies can make these decisions based on metrics.

Companies at any level of data sophistication can profit from such tools. Data profiling also reveals what can be moved to Hadoop for batch processing so that data migration occurs for a reason. Appfluent can even map the user activity within the business context, meaning companies can group activity, schemas, tables, and applications to determine utilization by business unit. With data becoming an ever-larger cost burden, these analytics not only allow companies to optimize their infrastructure, tailoring it to the needs and actual usage patterns of each business line, but also to see whether their teams are actually getting everything out of data that they can. This type of SQL-based agility is as cutting edge as it gets.

Automated Assist for Data Preparation

A new class of technology has emerged that accelerates data prep both in the Hadoop and data warehouse ecosystems. Trifacta, Paxata, and Teradata’s Loom are the leading companies in this domain. Each of these firms takes a different approach to allowing an analyst to vastly speed up the job of massaging data into shape.

Trifacta, for example, uses a combination of data visualization, human-computer interaction and machine learning techniques to make the data preparation process faster and more intuitive. Trifacta creates a recommended list of transformations that have been selected based on machine learning algorithms. The end user then can explore each recommendation and preview the effect of the suggested transformation visually before settling on a direction. Then transformations can be gradually applied in an interactive environment. This machine learning assist allows analysts to understand the nature of massive data sets that cannot be inspected other ways. Teradata’s Loom assists with transformations but also focuses on creating a catalog and capturing data lineage. Paxata uses semantic models to help improve the quality of data sets.

All three of these products and the others in the space are incredibly powerful tools for analysts, each seeking to discover the sweet spot.

While much of the focus of these companies is on serving the world of Hadoop, they are equally powerful to transform data coming from SQL sources and also to load data into SQL sources. In this way, agile, machine-learning-assisted data preparation can accelerate SQL data warehouses.

SQL on Demand

Finally, it should be noted that we are living in a glorious period for SQL-on-demand. If you want a massive Teradata cluster, you can get one in the Teradata cloud that is fully managed in short order. Amazon’s Relational Database Service provides access to common SQL implementations such as MySQL, Microsoft SQL Server, and others. Amazon’s RedShift provides SQL database for big data. Microsoft Azure also offers SQL in the cloud. NuoDB, VoltDB and many others offer full SQL implementations on a Software as a Service basis. Most of the tools mentioned work with these cloud implementations.

My point is that many of the justified criticisms of the SQL world really no longer apply if you employ the technologies mentioned in this article. There is no reason that SQL technology need be a bottleneck. It can be as agile as ever if you take the right approach.

Follow Dan Woods on Twitter:

Follow @danwoodsearly

Dan Woods is on a mission to help people find the technology they need to succeed. Users of technology should visit CITO Research, a publication where early adopters find technology that matters. Vendors should visit Evolved Media for advice about how to find the right buyers. See list of Dan's clients on this page.

More From Forbes

The SQL Empire Strikes Back To Claim Agility