BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

The People of the Petabyte

This article is more than 10 years old.

The earnest editorial staff here at Forbes urge us contributors not to get too clever with headlines, but since headline wit is one of the few genuine and unadulterated pleasures of blogging, I occasionally (okay, often) succumb to temptation and sacrifice SEO best practices for a bit of a private chuckle.

Besides, being at a major data-related conference like Strata, and listening to data geeks salivating over prospects of ruling the world by 2015 using large scale automated analysis of Web content, I've realized that it is up to us bad punners, over-clever reference droppers and metaphor-mixers to prevent the data geeks from accidentally creating Skynet.  Or at least delay the takeover. Giving up a few ounces of SEO juice and some traffic is a small price to pay. Call it the Skynet Resistance Tax.

Speaking of data geeks, that's the topic of this post. I've spent much of my first couple of days at Strata idly observing them (and wondering whether I am one of them), and speculating about the various species in the data scene (people-watching is how you stay awake between the interesting slides at conferences).  So here is my informal taxonomy and anthropological survey of data-land.

A Taxonomy for Data Land

The taxonomy part is simple. Apparently the list of species in data land is very short. It has only one item:

  • Data scientist

Okay, I am exaggerating a bit, but that's what it feels like, to hear the talks and hallway conversations. IT admins, six sigma types rushing to the data bandwagon, ex-BI types,  visualization and infographic geeks, analytics geeks, programmers, old-school statisticians, Hadoop wranglers -- they all seem to be calling themselves data scientist now. There are more complicated taxonomies floating around, but everybody appears wary of accepting them.

Which brings us to the informal anthropology of what's going on.

Everybody is a Data Scientist

I was gratified to learn in an early talk during the first day that I am apparently not a newbie to the field at all, like I thought. I too am a  data scientist, and a grizzled veteran at that. Apparently, I am among the vast numbers of people who've been doing data science all along without realizing it. All it takes is some rudimentary experience running statistical tests on a data set, mucking around trying to prove a couple of hypotheses, and generating a few visualizations. You don't need to know that Hadoop is a toy elephant to qualify for the title apparently. Having used Google Analytics or Search Insights counts as having experience with Big Data.

I am not making that case. One of the Strata speakers who can genuinely lay claim to the title made the case.

Extremely generous and inclusive, you say? Therein lies a story.

If you've been in the technology world for a while and have surfed a couple of hype cycles, you are probably familiar with all the types of identity angst you encounter around any new technology trend. At any given technology conference, you will find the following types:

  • People with chips on their shoulders about being marginalized by the new trend.
  • Long-ignored people who suddenly find that they've turned into stars, blinking in the spotlight.
  • People who feel under-appreciated and powerless.
  • People who cannot believe how much power they suddenly have.
  • People who secretly feel like fakes and are feeling either gleeful or ashamed about it.
  • People who are cleverly switching out their titles from the last hyped fad for the closest one they can find in the new one.
  • People upset that other people are taking credit for their old wine by putting it into new bottles.
  • Older people insisting nothing has changed (read: "therefore I am still the expert").
  • Younger people insisting everything has changed (read "the old fogeys know nothing; hire me instead").
  • People excited by anything new and shiny, whether or not the understand it.
  • Jaded people on paid-for junkets.
  • Uber-sociable types for whom it is all one big party.

All in all, every technology trend is a seething drama of identity angst.  And besides individuals, corporations have a stake in the labels-and-titles game as well. Big and established players like IBM, Microsoft and EMC must contend with an endless army of feisty startups trying to redefine the game in their own interests. Multi-million dollar deals can be won or lost based on whether you use the term "Big Data" or "Data Warehousing" in your sales pitch.

What makes the data scene interesting is that this soul-searching for identity appears to be happening with a peculiar urgency, and the data community has reached a certain bizarre consensus position that I've never seen before in technology: they've decided that everybody is a data scientist. I've never seen quite this level of title ambiguity before (the one field that might potentially have more title ambiguity is "user experience"). It's like the South Park episode about the alien race that uses the word "marklar" for everything

So why is everybody a data scientist?

The Detente Around Big Data

I think what we are seeing in the data game is a sort of uneasy detente.  Here are just a few examples of title/label wars that I've heard mentioned.

  • Data mining vs. machine learning: One speaker mentioned that people who used to go by the title data miner are now offended by the term and prefer to be called machine learning experts. There are differences in substance and connotations, but for whatever mysterious reason, the stock of the latter term appears to be appreciating. Those who switched titles early enough would have benefited while those who were a little less alert have left themselves open to the charge of band-wagonism.
  • BI/DW vs.  Big Data: On the industry side of the fence, many business intelligence and data warehousing veterans are smartly repackaging themselves as Big Data people.  Again, there is an underlying tension. In this case, a generational tension between mid-career, middle-management types who want to find roles in the new game, and a younger set trying to differentiate itself  by defining the new game in more exclusive ways. Is it the same old game or a new game? Certainly there are new technological elements that everybody acknowledges, but the significance of those elements depends on whether you are a veteran or a fresh young type.
  • Analysts vs. Analytics: People who pulled data, crunched it, and turned it into presentations used to be called analysts. Now those who wrangle real-time data streams and steward processing pipelines that feed live dashboards call themselves analytics experts. It is a similar skillset, but a different mindset. Again, there is a faultline with simmering tension.

The fact that people have converged on a strange truce -- calling everybody a data scientist -- is interesting. In other technology fields that I've tracked over the years, you generally have a new set of titles displacing an old set over a few years. This kind of extreme title convergence is rare.

My suspicion is that there is more than thin skin and an identity-politics truce here. There are four additional causes: money, integration pains, cultural mission and impact on the analyst trade.

Money

Big Data is Big Money. And the big adjective is crucial, because that's what justifies new infrastructure investments.

These are not the people of the byte, megabyte, gigabyte or even terabyte. They are people of the petabyte. The difference is the difference between an external hard-drive costing a few hundred bucks and a compute-cluster infrastructure that can employ a small army and use up a small town worth of electricity. Undoubtedly a lot of companies will go bankrupt setting up excess capacity they don't need and/or cannot sell. Undoubtedly a few companies (and I mean outside of the tech sector itself) will rewrite business playbooks by competing on Big Data. Costs will fall and margins will shrink. But a lot of money will be made in the meantime. A lot more money than in many similar booms. This might be the revolution that finally displaces old Big Iron mainframe infrastructure for good.

One speaker quipped that the difference between an "analyst" and a "data scientist" is about $40,000. The bigger the market, the bigger the incentive to stop infighting and forge an opaque consensus. Infighting creates a kind of transparency that benefits buyers, by allowing them to play divide-and-conquer games.

A book I am currently reading tells the story of how vicious patent battles over the Bessemer steel-making process in America were resolved primarily because of the huge dangling carrot of massive railroad contracts; it was in everybody's interests to stop arguing and declare peace, so the money-making could begin. Something similar is happening with Big Data. Not since the undersea optic fiber cable boom has there been such potential for massive investment. Nobody wants to ruin the party. It makes sense to paper over internal squabbles and let everybody be a data scientist.

Integration Pains

Data is also a world begging for tight vertical integration of the supply chain from raw, unrefined wild data all the way to AI programs whispering insights into CEO ears. If you are a Republican, the holy grail vision is a dashboard-driven company that will allow a CEO to run it as if he/she were driving a car, with no additional human involvement above minimum-wage levels. If you are a Democrat, the competing vision is the similarly empowered citizen (at the moment, the Republican vision is winning). I expect a keynote at the next Strata titled "Big Data and Little People" (hashtag #OccupyData).

This is not as far off as it might seem. One speaker showed a picture of a very impressive looking CEO cockpit at Proctor & Gamble, with two wall-sized screens and plumbing that allows you to zoom in from a world map to tootpaste sales in a single retail market with just a few clicks. And on the other end of the spectrum, the Average Joe can do do stunning things simply using Google's free tools.

That's the sort of market opportunity that creates mega-tycoons after all: integrate, drive down costs, squeeze expertise out of human brains and into non-human systems.

Unfortunately, despite the clear all around desire to automate and integrate, the data-wrangling processes are heavily dependent on human experts at the moment.  Which means efforts at process integration in pursuit of lower costs cause human title boundaries to blur.

Will these pains go away? Only if sufficient automation is achieved. But here, despite all the analogies to natural resources ("data is the new oil"), there are serious issues. It is not yet clear that extreme automation and process integration can be achieved around big data in all domains of economic interest.

Cultural Mission

Big Data is unique among recent IT trends in that it is a market and opportunity created by an open source movement. The entire industry exists because of Hadoop, an infrastructure component inspired by Google technology. So there has been a sense of unity around a shared non-commercial mission from Day 1. The consensual label "data scientist" is partly a consequence of a sense that the data scene is a social fraternity rather than a business sector.

There is also a pragmatic consensus that the biggest gains will be found by mixing up large datasets owned by different parties (think "mashups for titans"). This is an element of the scene that is rather like the effort to standardize railroad gauges in the 19th century, or containerization in the twentieth century. It reinforces the cultural mission by making standardization and data interoperability a matter of shared interest.

As a simple example, consider zip codes. As one speaker remarked, zip codes are simple, clean data that most of us choose to share fairly liberally. Yet we have to keep re-entering them on forms. There has to be a ton of value (and money to be made) in eliminating all that unnecessary typing.

Both these forces -- core non-commercial mission and pragmatism --  have made the data scene a business sector with a social mission at its heart.  "Data scientist" is almost a community self-identifier, like "brother" or "sister." Pretty soon, I am sure somebody will make a speech that starts, "brothers and sisters of the petabyte..."

Even a Microsoft speaker made soothing noises about "community." That tells you a lot.

The Analyst Trade

Normally, titles and labels stabilize when analyst firms like Gartner and Forrester drive towards a language of consensus. They are able to do that because they are generally disinterested third parties who make money by adding some clarity to muddy discourses.

But this time it is different. The thing about the world of data is that the analyst trade is among those most impacted. The business of analysts after all, is to compile data, crunch it and turn it into slideware. Now uppity data startups are threatening the analyst game (the trade show floor has several technology companies showing off clever visualizations on large monitors; if the analyst trade isn't worried, it should be).

Which means things are going to remain murky for a while.

You Too Can Become a Data Scientist

So the bottomline is that there is big money looming. Fortunes will be made and lost. Which means you too should attempt to become a data scientist.

The skills have become increasingly easy to acquire, and are getting easier by the week. But at the same time, cultural barriers to people self-classifying into the data scene are being erected.

Redefine yourself while you can. Let me know if you need any pickaxes.