Why Big Data Needs Natural Language Generation to Work

While we don’t often think of language as a user interface, that’s exactly what it is. We have thoughts, construct language to convey them, and send them out using our voices, keyboards, touchscreens, or our hands if we are using sign language. The goal of course is that the recipient understands our thoughts just the way we meant them.

Natural Language Generation (NLG) seeks to put a machine in the first part of this process. With Apple Siri, Google Now, and Microsoft Cortana, we can talk to our phones, which can understand the questions we are asking and often we get a useful answer in language that these systems generate. But think about the frame of this user experience. It is all about getting a quick answer to a simple question. Siri wouldn’t have much to say if you said: “I want to know everything our company knows about customer x” or “Tell me what recent actions might indicate a propensity to buy a new product” or “What’s the most effective strategy for selling a new product to this customer?"

For many applications, natural language can be preferable to the engaging visual interfaces we often encounter. As attractive as visually rich dashboards can be, when it comes to information density, they are usually far inferior to language. In a paragraph and a few bullet points, we can quickly tell a rich and complex story. In a variety of contexts, NLG technology from companies like Narrative Science, Automated Insights, Yseop, and, of course, IBM ’s Watson are proving this point.

But the bigger game of NLG is not about the language but about handling the growing number of insights that are being produced by big data through automated forms of analysis. If your idea of big data is that you have a data scientist doing some sort of analysis and then presenting it through a dashboard, you are thinking far too small. The fact of the matter is that big data really can’t be understood without machine learning and advanced statistical algorithms. While it takes skill and expertise to apply these methods, once you have them running, they continue to pump out the insights.

Now comes the problem. What happens to the thousands of insights that are being generated automatically by all of those nifty machine learning algorithms? How do they find their way to a person at the right time? At their best, NLG systems offer an answer to this problem. They act as a router that understands the importance of the insights and delivers them to people who will find them interesting and relevant to their jobs.

The great news for early adopters and innovators is that this is all now becoming incredibly affordable. Narrative Science seems to be furthest along in creating a platform to support the two key activities that make NLG applications sing: Understanding the signals in a domain and rendering useful language for specific audiences. To understand the secret of how to make NLG applications work for you, you need to understand this one-two punch.

Rendering Language

Creating sentences with software has been a challenge and a fascination since the invention of software. A program called ELIZA, created in 1966, attempted to be able to carry on a conversation. While there was no hope of ELIZA passing the Turing test, it did have moments when it seemed more than just software. ELIZA spawneda category of software called chatterbots.

Kris Hammond, Chief Scientist and co-founder of Narrative Science, said that many, many Artificial Intelligence Ph.D. students created language rendering routines to show off the power of their research projects.

Of course, once you start doing this, it gets complex. The language rendering software needs to adjust the language if the results are singular or plural. The software must be able to add variations so the interface doesn’t see to boring and algorithmic. For example, Siri will say, “The Patriots are favored by 7” the first time and “I’m hearing Dallas is favored by 1 point” the second time. There’s lots of blocking and tackling of this sort in language rendering software.

But it turns out that just knowing how to say something is only part of the answer. The bigger challenge is figuring out what to say. You have to deal with the problem of figuring out the answer as well as how to express it. Here’s where the limits of programmatic approaches to language rendering appear.

Let’s say, using our previous example, that there were 20 different databases of customer information. The first step would be to look in those databases for all of the information for a particular customer. Okay, now you have that pulled out and can start figuring out how to summarize it.

The most common way to do this is to create a template that can describe each type of information. The template might be conditional. If one type of information is missing, then leave out that part of the template. Or only include parts A, B, and C of the customer information if all of them are there.

Then you need to figure out how to suggest the right products for an upsell or the right way to do it. The template was already doing a lot of work and now you are going to ask it to do even more.

The problem with most NLG platforms is that they hard code intelligence into a template. This makes for systems that are brittle and hard to change and are not able to accept new data without new coding. Don’t get me wrong. You can create some excellent applications with this approach, but you also create a lot of technical debt.

Understanding a Domain

Narrative Science is the only company I’ve come across that addresses the challenge of embedding intelligence in NLG without being constrained by the use of templates and the brittleness they create.

Instead of hard coding the understanding of the signals from a domain in the template, Narrative Science has a separate semantic engine to make sense of the data. So in the case of the example, the data, once extracted, would then be sent to the semantic engine which would first determine what was true and then determine which of those signals are important and impactful to various audiences.

What is true is determined through the application of techniques that would be familiar to any data scientist: time series and regression analysis, histogramming, ranking, etc. The semantic engine then decides what’s important based on an understanding of what’s normal for the whole population of the data. So if a client’s data in one area was just average, that’s not too important. But if they are unusual in some way or if one of their attributes just changed, then you have a key signal. Three or four important signals may be used to suggest a certain type of upsell.

The second type of analysis the semantic engine does is to determine what is interesting or impactful to a particular audience. A retail representative at a bank may be interested in a whole different set of signals than someone who is originating mortgages.

The power of Narrative Science’s Quill platform is that it can pull the signal from the noise and then determine which of thousands of signals are important and which of those important signals should be sent to each specific audience. In addition, Quill has a memory of what was sent before, so it doesn’t become repetitive. It also offers full traceability. You can look back and see why each sentence was constructed the way it was. This is a huge boon when it comes to building trust in the system.

Quill creates applications that become routers of important signals and then renders them as needed for specific audiences. In this way, access to big data is not choked in a bottleneck based on how many dashboards each data scientist can create.

The fact of the matter is that to do the best job with NLG applications in the simplest and most maintainable way, you need a systematic approach to a semantic model. That’s the secret to having a big impact from big data.

Follow Dan Woods on Twitter:

Follow @danwoodsearly

Dan Woods is on a mission to help people find the technology they need to succeed. Users of technology should visit CITO Research, a publication where early adopters find technology that matters. Vendors should visit Evolved Media for advice about how to find the right buyers. See list of Dan's clients on this page.

More From Forbes

Why Big Data Needs Natural Language Generation to Work