If statisticians have historically been leaders of data, why was there a need for a brand new breed of data scientists? While the world is exploding with bounties of valuable data, statisticians are strangely working quietly in the shadows. Statistics is the science of learning from data, so why aren’t statisticians reigning as kings of today’s Big Data revolution?
In 2009, when Google was still fine tuning its PageRank algorithm based on the statistical innovation Markov Chain, Google’s Chief Economist Hal Varian declared statistician as the sexiest job of the decade. We’re about halfway through, and it seems that Varian missed the target.
“Professional statisticians are milling at the back of the church, mesmerized by the gaudy spectacle of [Big Data] before them.” – David Walker, statistician, Aug 2013.
Google Trends shows us that while the popularity of Big Data is thriving, statisticians’ popularity has been declining over the years. Back in 2010, predictive modeling and analytics website Kaggle proudly dangled Varian’s prediction as a carrot on their careers page to lure people to join their team. But today the quote curiously vanished–no longer deemed worthy.
What speaks even louder volumes is that statisticians are often left out of some of the biggest national discussions happening around Big Data today. For instance, UC Berkeley’s Terry Speed observes:
Justin Strauss, co-founder at Storyhackers, who previously led data science programs in the healthcare industry, can attest to this more generally. He says he has “seen an underrepresentation” of statisticians at conferences and other events related to Big Data. But statistics is the foundation of understanding Big Data. This was supposed to be their decade–their time to shine in the limelight. So, what changed? As renowned statistician Gerry Hahn once said:
“This is a Golden Age of statistics, but not necessarily for statisticians.”
Instead of crowning statisticians king, the Big Data revolution borrowed the foundational elements of applied statistics, married it with computer science and birthed an entirely new heir: The Data Scientist. But this underrepresentation of statisticians puts the future of Big Data at risk. The accurate evaluation of data that comes from a strong foundation of statistics could be lost in the hype.
Plenty has been written about the recent rise of data scientists, but the application of data science to the industry is ancient. In the 1900s, statistician William Gosset studied yeast for the Guinness Brewing Company and invented the t-distribution in the process. Statistician Kaiser Fung points out that one of the most notable examples of a business built upon statistical algorithms came decades before Google. Fair Isaac Company introduced the analytics of credit scoring in the 1950s. Not to mention the US government has been performing census calculations with incredible precision for hundreds of years as well.
There are three plausible reasons why statisticians aren’t leading Big Data today. First, computational statistics of Big Data never flourished in mainstream statistical sciences.
“The area of massive datasets, though currently of great interest to Computational statisticians and to many data analysts, has not yet become part of mainstream statistical science.” – Buja A. Keller-McNulty
This quote was published in 1999. And, a decade later, it never happened. Although early statisticians recognized and discussed Big Data, many of them were ignored. Speed points out that statisticians have published books and papers about the techniques of wrangling large datasets. But they collected dust, evident by the number of citations earned. For instance:
Second, statistics is a crucial part of data science, but it–alone–is insufficient in making sense of exponential amounts of messy data we are producing daily. It requires computational power that can only be charged by the advanced technology we have today. In 2010, the world stored about 6 exabytes of data, a stat so incomprehensible that it’s borderline meaningless. For a frame of reference, if you converted all words ever spoken by humans into text, that’s about 5 exabytes! Here are some more quick Big Data stats:
Machine learning is deeply rooted in statistics, but few statisticians have the technical skills to manipulate a dataset of 10 billion in which each data point has a dimension of 10,000. But it’s not that statisticians lack computational knowledge. It’s that the field of statistics simply wasn’t equipped with the computing power we have today. For instance, data scientist David Hardtke lead the invention of the Bright Score, an algorithm that assesses your fit for a job, which was acquired by LinkedIn. But he says none of these ideas are really new. Back when he first started in the space, he met a senior researcher at Recruit Holdings, a japanese recruiting firm.
“He told me he’s really interested in what I’m doing because he tried to do the same thing in the 80s. He said, frankly, it was way too expensive back then. You had to buy these massive computers and it wasn’t cost effective,” Hardtke says.
But now, we’re at this convergence of super cheap, high-speed computing that’s helping data scientists process powerful insights and find answers to questions that remained a mystery 20 years ago. With Big Data booming, pure statistics is fading into the background relative to the demand of data science.
Third, some statisticians simply have no interest in carrying out scientific methods for business-oriented data science. If you look at online discussions, pure statisticians often scoff at the hype surrounding the rise of data scientists in the industry. Some say it’s a buzzword with good marketing (here), other say it’s a made up title (here) and some call them folks who sold out to shareholders (here).
Even without a prominent presence of statisticians, educational institutions are churning out entirely new curriculums devoted to the so-called “new” field of data science in just the last few years. But when dealing with Big Data, someone on the team needs to have a strong grasp of statistics to avoid reaching inaccurate conclusions.
The elevated hype about data scientists is undeniable. The WSJ reports that these jobs are so in-demand that data scientists with two years of experience are earning between $200,000 and $300,000 annually. It was dubbed the sexiest job of the 21st century in 2012. Universities are having to turn down data science students because of the outpour in popularity. As a result, there are at least a dozen new data science bootcamps that aim to prepare graduates for data science jobs. And universities across the nation are creating brand new courses and programs for data science or business analytics. Here’s a visualization thanks to Columbia Data Science:
But, as with any new curriculum, space is limited. This is where it gets risky. Ben Reddy, PhD in Statistics, at Columbia University finds that the foundation of statistics often takes a backseat to learning the technical tools of the trade in data science classes. And even if students are carrying out statistical models in classes, doing statistics doesn’t guarantee that you understand statistics. Since learning R or NumPy is usually the gateway to getting your hands on real-world data, understanding statistical analysis is often less interesting comparatively.
“Anyone who can type <t.test(data)> into R (not to mention <lm()>, <knn()>, <gbm()>, etc.) can “do” statistics, even if they misuse those methods in ways that William Sealy Gosset wouldn’t approve on his booziest days at the Guinness brewery.” Reddy writes.
The worst part is, you can usually get away with carrying out subpar analysis because it’s hard to identify the quality of statistics without examining analysis in detail, he adds. And, usually, there’s not enough transparency to do this in the real-world. So, with the absence of statisticians in Big Data today, how well are the fundamentals of statistics carried over in this new data science boom? Most students haven’t even graduated from these brand new data science courses yet, so it remains to be seen.
But this risk in losing the fundamentals is largely why Hardtke, a physicist himself, is opposed to these new degree programs. He makes a compelling point: It’s better to have someone who’s really passionate about geology, physics or any other science because they’ll pick up the tools of data manipulation as part of a bigger mission.
“I’d rather have someone major to get some answer and learn the tools along the way rather than learn the tools as the terminal point,” Hardtke says.
Folks outside of the space often don’t realize that the most astonishing achievements in data science weren’t accomplished by just one superstar, unicorn data scientist. When Hardtke was tasked with building a strong data science team at startup Bright.com several years ago, he couldn’t afford to recruit the best data scientist away from the likes of Google and Facebook. But he knew something most data scientist-crazed recruiters don’t understand: At its core, it’s all about learning how to ingest data using statistical methodology and computational techniques to find an answer.
Most scientific disciplines require this knowledge. So, he hired scientists across disciplines: physicist, mechanical engineer, statistician, astrophysicist–basically anyone who wasn’t a computer scientist or data scientist. The most successful, passionate data science teams in Silicon Valley comprise of a combination of different scientific disciplines that look at one problem from unique angles. It’s the only way to work through seemingly impossible problems in data science.
If you ask Nitin Sharma, for instance, about his data science team at the early days of Google, his eyes instantly light up. With experts from psychology, computer science, statistics and several other disciplines, Sharma’s diverse team offered perspectives from every dimension possible. Google’s head of search Amit Singhal once asked him: “How do you know if people are happy with the search results?” Tracking the simple act of clicks on links can’t determine whether or not the searcher was happy with his result. And so, the challenge was on for Sharma’s team.
“I can’t tell you the details of what Google did, but conceptually, we looked at what sequence do these clicks have? How much time they’re spending? How often do they refine queries? How often do they click on results? How many results? How does it depend on the type of query?” Sharma says.
And, ultimately, Sharma’s team was able to work together to find a successful plan to monitor a user’s happiness, which offered deeper insight into search behavior and satisfaction with search results. While both data science and statistics share a common goal of extracting meaningful insight from data, the evolution of data science in the last 10 years emphasizes a demand for a combination of interdisciplinary skill.
Data science is making statistics–alone–irrelevant in industry. Hence, eclipsing statisticians, or fathers of data science.
On a scale of 1-10, Sharma says we’ve only inched maybe 1-2 in terms of progress in data science. With the forthcoming revolution of the Internet of Things, there’s infinite possibilities before us. The biggest challenge will be: How do we process and understand this unsurmountable data? The onus can’t be on “rockstar, unicorn” data scientists alone. And it can’t fall onto statisticians either. Although the demand for pure statistics will shrink relative to data science and over time, it’s going to be more important than ever to have interdisciplinary knowledge from a variety of fields. And to ensure quality and foundational understanding of applied statistics, it’s crucial to save a seat for statisticians at the Big Data table.
Have you noticed an underrepresentation of statisticians in Big Data? Tell us what you think in the comments below!