Would you like to receive similar articles straight to your inbox?

The Risky Eclipse of Statisticians

If statisticians have historically been leaders of data, why was there a need for a brand new breed of data scientists?  While the world is exploding with bounties of valuable data, statisticians are strangely working quietly in the shadows. Statistics is the science of learning from data, so why aren’t statisticians reigning as kings of today’s Big Data revolution?
In 2009, when Google was still fine tuning its PageRank algorithm based on the statistical innovation Markov Chain, Google’s Chief Economist Hal Varian declared statistician as the sexiest job of the decade. We’re about halfway through, and it seems that Varian missed the target.

“Professional statisticians are milling at the back of the church, mesmerized by the gaudy spectacle of [Big Data] before them.” – David Walker, statistician, Aug 2013.  

Google Trends shows us that while the popularity of Big Data is thriving, statisticians’ popularity has been declining over the years. Back in 2010, predictive modeling and analytics website Kaggle proudly dangled Varian’s prediction as a carrot on their careers page to lure people to join their team. But today the quote curiously vanished–no longer deemed worthy.
Screen Shot 2015-07-15 at 7.06.12 PM
What speaks even louder volumes is that statisticians are often left out of some of the biggest national discussions happening around Big Data today. For instance, UC Berkeley’s Terry Speed observes:

  • US National Science Foundation invited 100 experts to talk about Big Data in 2012. Total number of statisticians present? 0.
  • The US Department of Health and Human Services has a 17-person Big Data committee. Total number of statisticians? You guessed it…0.

Justin Strauss, co-founder at Storyhackers, who previously led data science programs in the healthcare industry, can attest to this more generally. He says he has “seen an underrepresentation” of statisticians at conferences and other events related to Big Data. But statistics is the foundation of understanding Big Data. This was supposed to be their decade–their time to shine in the limelight. So, what changed? As renowned statistician Gerry Hahn once said:

“This is a Golden Age of statistics, but not necessarily for statisticians.”

Instead of crowning statisticians king, the Big Data revolution borrowed the foundational elements of applied statistics, married it with computer science and birthed an entirely new heir: The Data Scientist. But this underrepresentation of statisticians puts the future of Big Data at risk. The accurate evaluation of data that comes from a strong foundation of statistics could be lost in the hype.

Why Didn’t Statisticians Own Big Data?

Plenty has been written about the recent rise of data scientists, but the application of data science to the industry is ancient. In the 1900s, statistician William Gosset studied yeast for the Guinness Brewing Company and invented the t-distribution in the process. Statistician Kaiser Fung points out that one of the most notable examples of a business built upon statistical algorithms came decades before Google. Fair Isaac Company introduced the analytics of credit scoring in the 1950s. Not to mention the US government has been performing census calculations with incredible precision for hundreds of years as well.
There are three plausible reasons why statisticians aren’t leading Big Data today. First, computational statistics of Big Data never flourished in mainstream statistical sciences.

“The area of massive datasets, though currently of great interest to Computational statisticians and to many data analysts, has not yet become part of mainstream statistical science.” – Buja A. Keller-McNulty

This quote was published in 1999. And, a decade later, it never happened. Although early statisticians recognized and discussed Big Data, many of them were ignored. Speed points out that statisticians have published books and papers about the techniques of wrangling large datasets. But they collected dust, evident by the number of citations earned. For instance:
Screen Shot 2015-07-20 at 8.53.36 AM
Second, statistics is a crucial part of data science, but it–alone–is insufficient in making sense of exponential amounts of messy data we are producing daily. It requires computational power that can only be charged by the advanced technology we have today. In 2010, the world stored about 6 exabytes of data, a stat so incomprehensible that it’s borderline meaningless. For a frame of reference, if you converted all words ever spoken by humans into text, that’s about 5 exabytes! Here are some more quick Big Data stats:
Untitled Infographic (13)
Machine learning is deeply rooted in statistics, but few statisticians have the technical skills to manipulate a dataset of 10 billion in which each data point has a dimension of 10,000. But it’s not that statisticians lack computational knowledge. It’s that the field of statistics simply wasn’t equipped with the computing power we have today. For instance, data scientist David Hardtke lead the invention of the Bright Score, an algorithm that assesses your fit for a job, which was acquired by LinkedIn. But he says none of these ideas are really new. Back when he first started in the space, he met a senior researcher at Recruit Holdings, a japanese recruiting firm.

“He told me he’s really interested in what I’m doing because he tried to do the same thing in the 80s. He said, frankly, it was way too expensive back then. You had to buy these massive computers and it wasn’t cost effective,” Hardtke says. 

But now, we’re at this convergence of super cheap, high-speed computing that’s helping data scientists process powerful insights and find answers to questions that remained a mystery 20 years ago. With Big Data booming, pure statistics is fading into the background relative to the demand of data science.
Third, some statisticians simply have no interest in carrying out scientific methods for business-oriented data science. If you look at online discussions, pure statisticians often scoff at the hype surrounding the rise of data scientists in the industry. Some say it’s a buzzword with good marketing (here), other say it’s a made up title (here) and some call them folks who sold out to shareholders (here).

Statisticians’ Absence Could Lead to Misuse of Data

Even without a prominent presence of statisticians, educational institutions are churning out entirely new curriculums devoted to the so-called “new” field of data science in just the last few years.  But when dealing with Big Data, someone on the team needs to have a strong grasp of statistics to avoid reaching inaccurate conclusions.

The elevated hype about data scientists is undeniable. The WSJ reports that these jobs are so in-demand that data scientists with two years of experience are earning between $200,000 and $300,000 annually. It was dubbed the sexiest job of the 21st century in 2012. Universities are having to turn down data science students because of the outpour in popularity. As a result, there are at least a dozen new data science bootcamps that aim to prepare graduates for data science jobs. And universities across the nation are creating brand new courses and programs for data science or business analytics. Here’s a visualization thanks to Columbia Data Science:


But, as with any new curriculum, space is limited. This is where it gets risky. Ben Reddy, PhD in Statistics, at Columbia University finds that the foundation of statistics often takes a backseat to learning the technical tools of the trade in data science classes. And even if students are carrying out statistical models in classes, doing statistics doesn’t guarantee that you understand statistics. Since learning R or NumPy is usually the gateway to getting your hands on real-world data, understanding statistical analysis is often less interesting comparatively.

“Anyone who can type <t.test(data)> into R (not to mention <lm()>, <knn()>, <gbm()>, etc.) can “do” statistics, even if they misuse those methods in ways that William Sealy Gosset wouldn’t approve on his booziest days at the Guinness brewery.” Reddy writes. 

The worst part is, you can usually get away with carrying out subpar analysis because it’s hard to identify the quality of statistics without examining analysis in detail, he adds. And, usually, there’s not enough transparency to do this in the real-world. So, with the absence of statisticians in Big Data today, how well are the fundamentals of statistics carried over in this new data science boom? Most students haven’t even graduated from these brand new data science courses yet, so it remains to be seen.

But this risk in losing the fundamentals is largely why Hardtke, a physicist himself, is opposed to these new degree programs. He makes a compelling point: It’s better to have someone who’s really passionate about geology, physics or any other science because they’ll pick up the tools of data manipulation as part of a bigger mission.

“I’d rather have someone major to get some answer and learn the tools along the way rather than learn the tools as the terminal point,” Hardtke says.

But the Most Powerful Data Science Teams are Multidimensional

Folks outside of the space often don’t realize that the most astonishing achievements in data science weren’t accomplished by just one superstar, unicorn data scientist. When Hardtke was tasked with building a strong data science team at startup Bright.com several years ago, he couldn’t afford to recruit the best data scientist away from the likes of Google and Facebook. But he knew something most data scientist-crazed recruiters don’t understand: At its core, it’s all about learning how to ingest data using statistical methodology and computational techniques to find an answer.

Most scientific disciplines require this knowledge. So, he hired scientists across disciplines: physicist, mechanical engineer, statistician, astrophysicist–basically anyone who wasn’t a computer scientist or data scientist. The most successful, passionate data science teams in Silicon Valley comprise of a combination of different scientific disciplines that look at one problem from unique angles. It’s the only way to work through seemingly impossible problems in data science.

If you ask Nitin Sharma, for instance, about his data science team at the early days of Google, his eyes instantly light up. With experts from psychology, computer science, statistics and several other disciplines, Sharma’s diverse team offered perspectives from every dimension possible. Google’s head of search Amit Singhal once asked him: “How do you know if people are happy with the search results?” Tracking the simple act of clicks on links can’t determine whether or not the searcher was happy with his result. And so, the challenge was on for Sharma’s team.

“I can’t tell you the details of what Google did, but conceptually, we looked at what sequence do these clicks have? How much time they’re spending? How often do they refine queries? How often do they click on results? How many results? How does it depend on the type of query?” Sharma says. 

And, ultimately, Sharma’s team was able to work together to find a successful plan to monitor a user’s happiness, which offered deeper insight into search behavior and satisfaction with search results. While both data science and statistics share a common goal of extracting meaningful insight from data, the evolution of data science in the last 10 years emphasizes a demand for a combination of interdisciplinary skill.
Data science is making statistics–alone–irrelevant in industry. Hence, eclipsing statisticians, or fathers of data science.

On a scale of 1-10, Sharma says we’ve only inched maybe 1-2 in terms of progress in data science. With the forthcoming revolution of the Internet of Things, there’s infinite possibilities before us. The biggest challenge will be: How do we process and understand this unsurmountable data? The onus can’t be on “rockstar, unicorn” data scientists alone. And it can’t fall onto statisticians either. Although the demand for pure statistics will shrink relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a variety of fields. And to ensure quality and foundational understanding of applied statistics, it’s crucial to save a seat for statisticians at the Big Data table.

Have you noticed an underrepresentation of statisticians in Big Data? Tell us what you think in the comments below!


Comments (24)

  • No, ‘data science’ is statistics done
    badly by computer programmers who
    know next to nothing about statistics.
    E.g., explain ‘sufficient’ statistics. How
    does this apply to order statistics. The
    Gaussian? The exponential family?
    The Radon-Nikodym theorem?
    How to do hypothesis tests? Distribution-
    free? Multi-dimensional? Data scientists
    have a lot of multi-dimensional data where
    knowing a distribution is not reasonable.
    So, how to do multi-dimensional, distribution-
    free hypothesis testing?
    What does the classic Neyman-Pearson
    result have to do with hypothesis testing?
    What is the strong law of large numbers?
    So, the computer programmers with the
    data decided to call themselves ‘data
    scientists’. It’s just statistics done badly
    by computer programmers.
    Why did it catch on? Who among the
    customers, managers, CEOs, knows
    the difference? Besides, what such
    person really wants someone around
    who knows more than they do, except
    maybe at hacking Python code?
    That the data is ‘big’ is next to irrelevant:
    They extract just the data they need and
    work with that; they don’t do much will
    all the data. Besides, statistics is about
    making estimates, and usually 50,000
    samples are quite sufficient and don’t really
    need 50 trillion.
    And it’s not just statistics that’s been set
    aside. Also on the shelves of the research
    libraries is a lot more in powerful, relevant
    applied math in probability (not the same
    as statistics), stochastic processes,
    optimization, control, etc. The computer
    science profs flounder around with these
    topics terribly. The lesson is: The customers
    want work that is less good, not better,
    but want a lot of hype. The hype won’t
    last very long. Data science will go the way
    of talking teddy bears. It’s a fad.
    For the past 70 years or so, these fields
    did good work because they had a good
    customer — US national security and the
    space race. But, a market with really
    bad customers stands to have really bad
    products, and so it is with data science.

    • sigmaalgebra
    • July 20, 2015 at 8:14 pm
    • Reply
    • Your data and post are biased.

      • JP Chastain
      • July 22, 2015 at 12:04 am
      • Reply
    • Worth reading just for this comment, much more insightful than the article itself.

    • Good points. However, your generalization to all Computer scientists forgets the Machine Learning community, which are predominantly CS people who are very adept with probability/statistics/optimization (and the seminal topics you listed above). The primary (nuanced) difference is they tend to emphasize algorithmic & computational aspects of data problems (i.e. aspects that affect performance in practice) rather than focusing on asymptotic theory and oft-unrealistic (but mathematically nice) generative models. Leo Breiman aptly described this rift: https://projecteuclid.org/euclid.ss/1009213726

      • Jonas
      • July 27, 2015 at 3:05 pm
      • Reply
      • Yes, sure, Breiman knew what the heck was going on, e.g., classification and regression trees and the piece you referenced — without reading it now I believe I read it long ago.
        Still looking at machine learning, I could never find anything but largely intuitive ‘fitting’ to data. That goal is not wrong, but I could never find anything that looked like significant progress
        on it.
        Theorems and proofs are still where the real power is.

        • sigmaalgebra
        • July 27, 2015 at 3:13 pm
        • Reply
        • Of course, theory is important & there is much rich theory in ML.
          However, it is simply of a different flavor, e.g. the literature on
          PAC learning, VC dimension, boosting, computational complexity,
          characteristics of (stochastic) optimization methods, causality,
          papers at the Conference on Learning Theory.
          Statistics, being the first field to deal with data,
          has naturally developed the most fundamental theory in this domain,
          but that does not mean the later theory developed in CS is irrelevant.

          • Jonas
          • July 27, 2015 at 3:33 pm
          • Reply
          • Clearly you found some better material in machine learning than I did.
            I should avoid mentioning the sources I didn’t like!

            • sigmaalgebra
            • July 27, 2015 at 6:00 pm
  • “Big Data” is nothing but the politicization of statistics, which explains why you don’t see true statisticians doing it.

  • Big Data is also the name of a recently popular band, who released their first album in October 2013. Not sure how much that skews the GA keyword analysis.

    • Scott Trenda
    • July 21, 2015 at 1:51 pm
    • Reply
  • The chart at the top is also misleading. Big data is a concept, data science is a field, but statistician is a profession. If you trend statistics, search results are much higher than both big data and data science.

    • Howard Gross
    • July 21, 2015 at 2:09 pm
    • Reply
  • I learned nothing by reading this article.

    • ChasStevenson
    • July 21, 2015 at 4:43 pm
    • Reply
  • Everyone I know who is a data scientist is classically trained as either a statistician, econometrician or mathematician. Moreover, they’re all using the same math which is solidly grounded within the ambit of statistics.

    • jfinsterwald
    • July 21, 2015 at 4:57 pm
    • Reply
  • I’m a statistician, and a psychologist. I chose the term Cyberpsychologist because no one hires a statistician for their advertising and SEO work, and the trends data uses the exact word of “statistician” when there are variety of math minded folk who identify under other job titles such as conversion scientist, or optimization specialist. Semantics peeps.

    • JP Chastain
    • July 22, 2015 at 12:03 am
    • Reply
  • “data scientist is just a “sexed up” term for statistician”
    -Nate Silver

    • jackbr5820
    • July 22, 2015 at 6:58 pm
    • Reply
  • This blog repeats some of the unfounded myths we have seen on the internet for the past two years. I am writing a series on Statistics Denial to address them. The series is anchored by an article in Analytics Magazine: http://goo.gl/Wod3gk. Datafloq has been carrying the complete series so far: https://datafloq.com/read/author/randy-bartlett/279. Your comments are welcome.

    • Randeroid
    • July 23, 2015 at 6:17 am
    • Reply
  • Warning: my response is going to sound like a self-promotion,
    because it is!
    I totally agree with the article but, for me, the principal
    question is why the analytical community has been taking a back seat to data
    science. My opinion, FWIW, is that we’ve always been difficult to identify and,
    typically, most of us have gotten our jobs by either knowing the right people
    or happening to be at the right place at the right time.
    What we’ve never had is someone or thing we could trust to
    capture our skills and experience in a single database of analytical
    professionals the we, ourselves, could simultaneously keep our confidentiality,
    yet give all companies and recruiters a way to find us by matching our
    credential with what those they happen to be seeking.
    We’re all not statisticians, per se, but have all been
    trained as such. I happen to be a Psychologist but, throughout my career, have
    worked as an analytical professional.
    When companies and recruiters are seeking those unicorns (or
    as close as they can find), we are more often than not more trained and
    experienced than what they end up having to accept.
    You can read about my proposed solution, company and service
    at: https://www.linkedin.com/pulse/married-looking-arthur-tabachneck?trk=pulse_spock-articles
    Art, CEO, AnalystFinder.com

    • Arthur Tabachneck
    • July 24, 2015 at 9:55 pm
    • Reply
  • Big Data which explains why you don’t see true statisticians doing it and Although the requirement for pure statistics will increase relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a various of fields. And to ensure quality and Basic understanding of applied statistics, it’s important to preserve a seat for statisticians at the Big Data table.
    Thank you for sharing information

  • Big Data which explains why you don’t see true statisticians doing it and Although the requirement for pure statistics will increase relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a various of fields. And to ensure quality and Basic understanding of applied statistics, it’s important to preserve a seat for statisticians at the Big Data table. The primary (nuanced) difference is they tend to emphasize algorithmic & computational aspects of data problems (i.e. aspects that affect performance in practice) rather than focusing on asymptotic theory and oft-unrealistic (but mathematically nice) generative models
    Thank you for sharing information

  • For good data analytics

  • big data analytics companies
    execute a proof of concept around the primary business use case (PoC). PoC should aim at combining internal data from data warehouses, log files and transactional systems such as ERP, CRM, log files with external data from social media, benchmarks or third party data

Leave a Reply

Your email address will not be published. Required fields are marked *