NSA Mass Surveillance: Biggest Big Data Story

When people talk about the US National Security Agency/Central Security Service (NSA), the talk usually centers on privacy, with good reason. Still, it’s not the only subject worth discussing. The volume of data collected by the NSA and the associated costs make it the ultimate in Big Data case studies. What can it tell us about data and business? What can it tell us about business risk and the potential benefits and consequences of Big Data investments?

The agency’s exact budget is a government secret, but estimates put it around $10 billion per year. Although not all of that is devoted to surveillance, it’s reasonable to conclude that something in the ballpark of $5 billion goes to fund NSA data gathering each year. This may not be the clear-cut biggest Big Data application (Google’s revenue was $66 billion last year, for example), but it’s substantial, focused and paid for by the public. We ought to discuss what we’re getting for the money.

The budget is not the only cost of any Big Data program. Data gathering and analysis has an impact on public perception and everyday business practices. Do it wrong, and you could run into a lot of costs you never expected. NSA programs have led to costs that the government and public may not have anticipated: correcting functional problems, lost business to US companies, additional security costs to US individuals and businesses seeking to protect private data and the lost influence due to damaged credibility of US government and businesses.

Spies have always depended on communication surveillance to obtain information. Stealing documents, listening in on conversations and cracking the codes of secret messages are basics of the profession. Electronics have been part of the mix for decades: the British used an elaborate electronic surveillance system to listen in on captured German officers during the 1940s. What’s new is the volume and breadth of information gathered.

Communication surveillance is a major part of the NSA’s mission (paired with protecting sensitive US communications). Years before Edward Snowden leaked details of the NSA’s mass surveillance of US citizens, Evan Coldewey of TechCrunch reported “NSA to store yottabytes of surveillance data in Utah megarepository”, though that figure was quickly challenged and a later update revised the figure to “not so much”. While Coldewey, writing in 2009, may have been a little off-base on the quantity, he was right on target when he said the purpose was to store data from extensive surveillance programs. In 2012, James Bamford of Wired placed the cost of building that data repository at $2 billion, and quoted an unnamed NSA official stating, “Everybody’s a target; everybody with communication is a target.”

Everybody’s a target. There’s the thing about Big Data. When you collect heaps and heaps of data, you may expect you’ll know all about everybody, but in practice, it may not end up that way.

When I was researching data sources for my book, Data Mining for Dummies, I gathered data on myself from several providers. These sources offer a lot of personal information. They can tell you, for example, that I’m single, a fan of gardening and aerobics, a pet owner, and a regular user of American Express and Discover cards. They can tell you my income and what month my insurance payment is due. What a lot of detail! Think of what you could do with information like that. But you won’t get the results you want, because every bit of that information is wrong.

When the NSA obtains communications data, it has advantages that you do not. It can get communication data directly from the source, and that’s good behavioral data rather than self-reported or other secondary sources, which are consistently of inferior quality. But huge volumes of data come with huge problems. The data management burden is stupendous, and most of that data is irrelevant to the intended purpose. Most people, and most communications, are not involved in government spying, terrorism or other crimes of interest to the NSA.

Because Big Data sources usually are not specific to any particular application, they are not necessarily the best data resources for solving any particular problem. A small volume of data, carefully collected for relevance and quality may offer more power.

So what do the NSA’s Big Data programs provide us in return for our money?

Senator Dianne Feinstein, in a 2013 Wall Street Journal op-ed, said we’re getting a lot. “Working in combination, the call-records database and other NSA programs have aided efforts by U.S. intelligence agencies to disrupt terrorism in the U.S. approximately a dozen times in recent years, according to the NSA. This summer, the agency disclosed that 54 terrorist events have been interrupted -- including plots stopped and arrests made for support to terrorism. Thirteen events were in the U.S. homeland and nine involved U.S. persons or facilities overseas. Twenty-five were in Europe, five in Africa and 11 in Asia.”

But not everyone shares that view of the results. Some claim the agency is simply overwhelmed with data. When Senator Feinstein told us that terrorist events were stopped and arrests made in the US, I found myself wondering why she wasn’t talking about convictions. I wondered why all that data wasn’t adequate to prevent the 2013 Boston Marathon bombing.

Traditional research methods and resources sometimes produce better results than massive data sources. The successful hunt to find Osama Bin Laden, a man with considerable resources and motivation not to be found, was done the old-fashioned way. Trained analysts, working with documents and other sources thoughtfully researched the target over a long time. It was unglamorous work, and not highly appreciated during much of the time it went on. In early 2001, similar techniques provided warnings of a threat months before the attacks of September 11.

We talk a lot about privacy implications of Big Data, as we should, but we don’t talk much about the costs and the quality of the results. As a statistician and data miner, I appreciate the value of data analysis, but also appreciate its limits. When we invest in data, whether in government, business or any aspect of life, we ought to put serious thought and discussion into what value we’re getting for out money and effort.

Follow me on LinkedIn. Check out my website.

More From Forbes

NSA Mass Surveillance: Biggest Big Data Story