BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Searching For The Smoking Gun With Text Analytics

Following
This article is more than 8 years old.

Oleksandr Pryymak, a data scientist who works at Facebook, grapples with this challenge: how can you quantify surprise? Speaking early Sunday morning to an audience of computing experts at PyData London 2015, he explained concepts and mathematics that might be used for this purpose.

Pryymak’s math is rather esoteric, but it can be applied to common uses. The object is to enable computers to do something we all do every day: identify unusual things and events. He presents the example of detecting newsworthy stories in a social media feed.

The use of computerized mathematical and linguistic methods to analyze written language is known as “text analytics.” These techniques are most often used summarize text, to identify common themes or attitudes expressed in words. Searching for surprise in text is a relatively new and active development in text analytics.

Similar techniques might be applied to other practical uses such as identifying security breaches, spotting the first signs of an epidemic, or tracking down significant documents in legal actions.

The ability to identify unusual text has significant value for legal applications. Decades ago, reviewing documents for litigation meant sifting through box loads of paper, searching for the “smoking gun,” that is, the documents that held evidence of wrongdoing. Today, these searches are equally common and important, but the documents are in electronic formats, and the volume of material can be tremendous. The expanding challenge of document review has created more and more work for attorneys, and greater and greater cost for litigants.

Landmark legal decisions, beginning with a decision and opinion by United States Magistrate Judge Andrew J. Peck in 2012 have opened the door to a new process that lightens the load of document review through automated analysis of text. Known as “predictive coding,” or “e-discovery,” this process uses specialized text analytics software to search documents and predict which are or are not likely to include material relevant to a particular legal action. Lawyers are not totally off the hook; manual review is still part of the process, but less time is required, reducing legal workload and costs.

The decision launched an industry. E-discovery products, events and professional groups have proliferated over the past three years.

While Judge Peck’s 2012 decision encouraged the use of predictive coding, details of how to go about it were left open to interpretation. The legal community as well as text analytics experts are still in the process of learning what is possible, from an analysis standpoint, and what will be acceptable to the courts. In a recent decision, he Judge Peck went further, encouraging cooperation between parties through sharing of non-privileged documents information about review methods.

Despite the unusual day and time, Pryymak’s talk attracted dozens of enthusiastic attendees. E-discovery has become so significant that whole conferences and professional groups are devoted to it. Looks like the mathematics of surprise are surprisingly popular.