previous article

- Hire
- Published on

If statisticians have historically been leaders of data, why was there a need for a brand new breed of data scientists? While the world is exploding with bounties of valuable data, statisticians are strangely working quietly in the shadows. Statistics is the science of learning from data, so why aren’t statisticians reigning as kings of today’s Big Data revolution?

In 2009, when Google was still fine tuning its PageRank algorithm based on the statistical innovation Markov Chain, Google’s Chief Economist Hal Varian declared statistician as the sexiest job of the decade. We’re about halfway through, and it seems that Varian missed the target.

“Professional statisticians are milling at the back of the church, mesmerized by the gaudy spectacle of [Big Data] before them.” –David Walker, statistician, Aug 2013.

Google Trends shows us that while the popularity of Big Data is thriving, statisticians’ popularity has been declining over the years. Back in 2010, predictive modeling and analytics website Kaggle proudly dangled Varian’s prediction as a carrot on their careers page to lure people to join their team. But today the quote curiously vanished–no longer deemed worthy.

What speaks even louder volumes is that statisticians are often left out of some of the biggest national discussions happening around Big Data today. For instance, UC Berkeley’s Terry Speed observes:

- US National Science Foundation invited 100 experts to talk about Big Data in 2012. Total number of statisticians present? 0.
- The US Department of Health and Human Services has a 17-person Big Data committee. Total number of statisticians? You guessed it…0.

Justin Strauss, co-founder at Storyhackers, who previously led data science programs in the healthcare industry, can attest to this more generally. He says he has “seen an underrepresentation” of statisticians at conferences and other events related to Big Data. But statistics is the foundation of understanding Big Data. This was supposed to be their decade–their time to shine in the limelight. So, what changed? As renowned statistician Gerry Hahn once said:

“This is a Golden Age of statistics, but not necessarily for statisticians.”

Instead of crowning statisticians king, the Big Data revolution borrowed the foundational elements of applied statistics, married it with computer science and birthed an entirely new heir: The Data Scientist. But this underrepresentation of statisticians puts the future of Big Data at risk. The accurate evaluation of data that comes from a strong foundation of statistics could be lost in the hype.

Plenty has been written about the recent rise of data scientists, but the application of data science to the industry is ancient. In the 1900s, statistician William Gosset studied yeast for the Guinness Brewing Company and invented the *t*-distribution in the process. Statistician Kaiser Fung points out that one of the most notable examples of a business built upon statistical algorithms came decades before Google. Fair Isaac Company introduced the analytics of credit scoring in the 1950s. Not to mention the US government has been performing census calculations with incredible precision for hundreds of years as well.

There are three plausible reasons why statisticians aren’t leading Big Data today. First, computational statistics of Big Data never flourished in mainstream statistical sciences.

“The area of massive datasets, though currently of great interest to Computational statisticians and to many data analysts,has not yetbecome part of mainstream statistical science.” – Buja A. Keller-McNulty

This quote was published in 1999. And, a decade later, it never happened. Although early statisticians recognized and discussed Big Data, many of them were ignored. Speed points out that statisticians have published books and papers about the techniques of wrangling large datasets. But they collected dust, evident by the number of citations earned. For instance:

Second, statistics is a crucial part of data science, but it–alone–is insufficient in making sense of exponential amounts of messy data we are producing daily. It requires computational power that can only be charged by the advanced technology we have today. In 2010, the world stored about 6 exabytes of data, a stat so incomprehensible that it’s borderline meaningless. For a frame of reference, if you converted all words ever spoken by humans into text, that’s about 5 exabytes! Here are some more quick Big Data stats:

Machine learning is deeply rooted in statistics, but few statisticians have the technical skills to manipulate a dataset of 10 billion in which each data point has a dimension of 10,000. But it’s not that statisticians lack computational knowledge. It’s that the field of statistics simply wasn’t equipped with the computing power we have today. For instance, data scientist David Hardtke lead the invention of the Bright Score, an algorithm that assesses your fit for a job, which was acquired by LinkedIn. But he says none of these ideas are really *new*. Back when he first started in the space, he met a senior researcher at Recruit Holdings, a japanese recruiting firm.

“He told me he’s really interested in what I’m doing because he tried to do the same thing in the 80s. He said, frankly, it was way too expensive back then. You had to buy these massive computers and it wasn’t cost effective,” Hardtke says.

But now, we’re at this convergence of super cheap, high-speed computing that’s helping data scientists process powerful insights and find answers to questions that remained a mystery 20 years ago. With Big Data booming, pure statistics is fading into the background relative to the demand of data science.

Third, some statisticians simply have no interest in carrying out scientific methods for business-oriented data science. If you look at online discussions, pure statisticians often scoff at the hype surrounding the rise of data scientists in the industry. Some say it’s a buzzword with good marketing (here), other say it’s a made up title (here) and some call them folks who sold out to shareholders (here).

Even without a prominent presence of statisticians, educational institutions are churning out entirely new curriculums devoted to the so-called “new” field of data science in just the last few years. But when dealing with Big Data, someone on the team needs to have a strong grasp of statistics to avoid reaching inaccurate conclusions.

The elevated hype about data scientists is undeniable. The WSJ reports that these jobs are so in-demand that data scientists with two years of experience are earning between $200,000 and $300,000 annually. It was dubbed the sexiest job of the 21st century in 2012. Universities are having to turn down data science students because of the outpour in popularity. As a result, there are at least a dozen new data science bootcamps that aim to prepare graduates for data science jobs. And universities across the nation are creating brand new courses and programs for data science or business analytics. Here’s a visualization thanks to Columbia Data Science:

But, as with any new curriculum, space is limited. This is where it gets risky. Ben Reddy, PhD in Statistics, at Columbia University finds that the foundation of statistics often takes a backseat to learning the technical tools of the trade in data science classes. And even if students are carrying out statistical models in classes, doing statistics doesn’t guarantee that you understand statistics. Since learning R or NumPy is usually the gateway to getting your hands on real-world data, understanding statistical analysis is often less interesting comparatively.

“Anyone who can type <t.test(data)> into R (not to mention <lm()>, <knn()>, <gbm()>, etc.) can “do” statistics, even if they misuse those methods in ways thatWilliam Sealy Gossetwouldn’t approve on his booziest days at the Guinness brewery.” Reddy writes.

The worst part is, you can usually get away with carrying out subpar analysis because it’s hard to identify the quality of statistics without examining analysis in detail, he adds. And, usually, there’s not enough transparency to do this in the real-world. So, with the absence of statisticians in Big Data today, how well are the fundamentals of statistics carried over in this new data science boom? Most students haven’t even graduated from these brand new data science courses yet, so it remains to be seen.

But this risk in losing the fundamentals is largely why Hardtke, a physicist himself, is opposed to these new degree programs. He makes a compelling point: It’s better to have someone who’s really passionate about geology, physics or any other science because they’ll pick up the tools of data manipulation as part of a bigger mission.

“I’d rather have someone major to get some answer and learn the tools along the way rather than learn the tools as the terminal point,” Hardtke says.

Folks outside of the space often don’t realize that the most astonishing achievements in data science weren’t accomplished by just one superstar, unicorn data scientist. When Hardtke was tasked with building a strong data science team at startup Bright.com several years ago, he couldn’t afford to recruit the best data scientist away from the likes of Google and Facebook. But he knew something most data scientist-crazed recruiters don’t understand: At its core, it’s all about learning how to ingest data using statistical methodology and computational techniques to find an answer.

Most scientific disciplines require this knowledge. So, he hired scientists across disciplines: physicist, mechanical engineer, statistician, astrophysicist–basically anyone who wasn’t a computer scientist or data scientist. The most successful, passionate data science teams in Silicon Valley comprise of a combination of different scientific disciplines that look at one problem from unique angles. It’s the only way to work through seemingly impossible problems in data science.

If you ask Nitin Sharma, for instance, about his data science team at the early days of Google, his eyes instantly light up. With experts from psychology, computer science, statistics and several other disciplines, Sharma’s diverse team offered perspectives from every dimension possible. Google’s head of search Amit Singhal once asked him: “How do you know if people are happy with the search results?” Tracking the simple act of clicks on links can’t determine whether or not the searcher was happy with his result. And so, the challenge was on for Sharma’s team.

“I can’t tell you the details of what Google did, butconceptually, we looked at what sequence do these clicks have? How much time they’re spending? How often do they refine queries? How often do they click on results? How many results? How does it depend on the type of query?” Sharma says.

And, ultimately, Sharma’s team was able to work together to find a successful plan to monitor a user’s happiness, which offered deeper insight into search behavior and satisfaction with search results. While both data science and statistics share a common goal of extracting meaningful insight from data, the evolution of data science in the last 10 years emphasizes a demand for a combination of interdisciplinary skill.

Data science is making statistics–alone–irrelevant in industry. Hence, eclipsing statisticians, or fathers of data science.

On a scale of 1-10, Sharma says we’ve only inched maybe 1-2 in terms of progress in data science. With the forthcoming revolution of the Internet of Things, there’s infinite possibilities before us. The biggest challenge will be: How do we process and understand this unsurmountable data? The onus can’t be on “rockstar, unicorn” data scientists alone. And it can’t fall onto statisticians either. Although the demand for pure statistics will shrink relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a variety of fields. And to ensure quality and foundational understanding of applied statistics, it’s crucial to save a seat for statisticians at the Big Data table.

**Have you noticed an underrepresentation of statisticians in Big Data? Tell us what you think in the comments below!**

No, ‘data science’ is statistics done

badly by computer programmers who

know next to nothing about statistics.

E.g., explain ‘sufficient’ statistics. How

does this apply to order statistics. The

Gaussian? The exponential family?

The Radon-Nikodym theorem?

How to do hypothesis tests? Distribution-

free? Multi-dimensional? Data scientists

have a lot of multi-dimensional data where

knowing a distribution is not reasonable.

So, how to do multi-dimensional, distribution-

free hypothesis testing?

What does the classic Neyman-Pearson

result have to do with hypothesis testing?

What is the strong law of large numbers?

So, the computer programmers with the

data decided to call themselves ‘data

scientists’. It’s just statistics done badly

by computer programmers.

Why did it catch on? Who among the

customers, managers, CEOs, knows

the difference? Besides, what such

person really wants someone around

who knows more than they do, except

maybe at hacking Python code?

That the data is ‘big’ is next to irrelevant:

They extract just the data they need and

work with that; they don’t do much will

all the data. Besides, statistics is about

making estimates, and usually 50,000

samples are quite sufficient and don’t really

need 50 trillion.

And it’s not just statistics that’s been set

aside. Also on the shelves of the research

libraries is a lot more in powerful, relevant

applied math in probability (not the same

as statistics), stochastic processes,

optimization, control, etc. The computer

science profs flounder around with these

topics terribly. The lesson is: The customers

want work that is less good, not better,

but want a lot of hype. The hype won’t

last very long. Data science will go the way

of talking teddy bears. It’s a fad.

For the past 70 years or so, these fields

did good work because they had a good

customer — US national security and the

space race. But, a market with really

bad customers stands to have really bad

products, and so it is with data science.

Your data and post are biased.

Worth reading just for this comment, much more insightful than the article itself.

Good points. However, your generalization to all Computer scientists forgets the Machine Learning community, which are predominantly CS people who are very adept with probability/statistics/optimization (and the seminal topics you listed above). The primary (nuanced) difference is they tend to emphasize algorithmic & computational aspects of data problems (i.e. aspects that affect performance in practice) rather than focusing on asymptotic theory and oft-unrealistic (but mathematically nice) generative models. Leo Breiman aptly described this rift: https://projecteuclid.org/euclid.ss/1009213726

Yes, sure, Breiman knew what the heck was going on, e.g., classification and regression trees and the piece you referenced — without reading it now I believe I read it long ago.

Still looking at machine learning, I could never find anything but largely intuitive ‘fitting’ to data. That goal is not wrong, but I could never find anything that looked like significant progress

on it.

Theorems and proofs are still where the real power is.

Of course, theory is important & there is much rich theory in ML.

However, it is simply of a different flavor, e.g. the literature on

PAC learning, VC dimension, boosting, computational complexity,

characteristics of (stochastic) optimization methods, causality,

papers at the Conference on Learning Theory.

Statistics, being the first field to deal with data,

has naturally developed the most fundamental theory in this domain,

but that does not mean the later theory developed in CS is irrelevant.

Clearly you found some better material in machine learning than I did.

I should avoid mentioning the sources I didn’t like!

“Big Data” is nothing but the politicization of statistics, which explains why you don’t see true statisticians doing it.

Big Data is also the name of a recently popular band, who released their first album in October 2013. Not sure how much that skews the GA keyword analysis.

The chart at the top is also misleading. Big data is a concept, data science is a field, but statistician is a profession. If you trend statistics, search results are much higher than both big data and data science.

I learned nothing by reading this article.

Everyone I know who is a data scientist is classically trained as either a statistician, econometrician or mathematician. Moreover, they’re all using the same math which is solidly grounded within the ambit of statistics.

I’m a statistician, and a psychologist. I chose the term Cyberpsychologist because no one hires a statistician for their advertising and SEO work, and the trends data uses the exact word of “statistician” when there are variety of math minded folk who identify under other job titles such as conversion scientist, or optimization specialist. Semantics peeps.

“data scientist is just a “sexed up” term for statistician”

-Nate Silver

This blog repeats some of the unfounded myths we have seen on the internet for the past two years. I am writing a series on Statistics Denial to address them. The series is anchored by an article in Analytics Magazine: http://goo.gl/Wod3gk. Datafloq has been carrying the complete series so far: https://datafloq.com/read/author/randy-bartlett/279. Your comments are welcome.

Warning: my response is going to sound like a self-promotion,

because it is!

I totally agree with the article but, for me, the principal

question is why the analytical community has been taking a back seat to data

science. My opinion, FWIW, is that we’ve always been difficult to identify and,

typically, most of us have gotten our jobs by either knowing the right people

or happening to be at the right place at the right time.

What we’ve never had is someone or thing we could trust to

capture our skills and experience in a single database of analytical

professionals the we, ourselves, could simultaneously keep our confidentiality,

yet give all companies and recruiters a way to find us by matching our

credential with what those they happen to be seeking.

We’re all not statisticians, per se, but have all been

trained as such. I happen to be a Psychologist but, throughout my career, have

worked as an analytical professional.

When companies and recruiters are seeking those unicorns (or

as close as they can find), we are more often than not more trained and

experienced than what they end up having to accept.

You can read about my proposed solution, company and service

at: https://www.linkedin.com/pulse/married-looking-arthur-tabachneck?trk=pulse_spock-articles

Art, CEO, AnalystFinder.com

Big Data which explains why you don’t see true statisticians doing it and Although the requirement for pure statistics will increase relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a various of fields. And to ensure quality and Basic understanding of applied statistics, it’s important to preserve a seat for statisticians at the Big Data table.

Thank you for sharing information

Big Data which explains why you don’t see true statisticians doing it and Although the requirement for pure statistics will increase relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a various of fields. And to ensure quality and Basic understanding of applied statistics, it’s important to preserve a seat for statisticians at the Big Data table. The primary (nuanced) difference is they tend to emphasize algorithmic & computational aspects of data problems (i.e. aspects that affect performance in practice) rather than focusing on asymptotic theory and oft-unrealistic (but mathematically nice) generative models

Thank you for sharing information

For good data analytics

big data analytics companies

execute a proof of concept around the primary business use case (PoC). PoC should aim at combining internal data from data warehouses, log files and transactional systems such as ERP, CRM, log files with external data from social media, benchmarks or third party data