Data Science, what is it and why do we need it?

Hidden Statistics

On more than one occasion I have been really upset to see how «Data Science» was used to cover up an application of Statistics. Also to find out that expressions such as Big Data, Artificial Intelligence or Machine Learning left Statistics out of scene, even taking away funding opportunities as I mentioned in the following tweet (in Spanish).

But, as I stated in a mini survey, also on Twitter, what is really Data Science? Is it simply another way of calling Statistics? Or is it something else?

The option with the largest percentage of votes (43%) was «Statistics + Computer Science» followed by «Statistics» and «A new science» with a 20% each. The tweet received all kind of responses. Some people mentioned that the options were incomplete, and they were surely right, Twitter was maybe not the best place for this type of surveys. This is, hence, the reason for me writing here today. I would like to explain my point of view on this topic. Not the impressions from an angry Anabel, but the ideas from my reflective part, trying to understand why we have come this far, why am I outraged and if a new science like this is needed. But for you to understand the reason of my anger, let me put you in context.

Since I finished my degree (first in Maths and then in Statistics) and during my PhD, I have come across a lot of great news about the progress of this and that science. I especially remember a space in a radio show dedicated to Stents (a device that allows to open spaces in clogged arteries). In the interview, with each new answered question, I saw statistics, tests, samples, significant differences … Everything proved could be wrong if the statistics were not well done, if the sample size was not correct or the method used was not appropriate to the given type of data. However, not a mention to such a discipline was heard, I even doubt if there was any statistical expert in the research team.

But the Stents’ example was just a drop in the ocean. Any advance in science in which things do not have a deterministic demonstration and there exist a certain degree of uncertainty, undoubtedly need to work together with Statistics.

However, there is always a moment when you can hear the word «Statistics» for sure. This is when referring to a manipulated graph on television, or when talking about mistakes in the calculations of unemployment or other rates. Definitely, Statistics are often associated with errors, lies or absurd simplifications such as «if I have two chickens and you none, we have eaten one each on average but you are still hungry». Please, rise your hand if your are a statistician but have never been faced with the , unfortunately famous, phrase by Mark Twain: «There are three types of lies: Lies, damned lies, and statistics». No raised hands? I see…

Life changes

And while this was happening, the world was changing. The amount of stored data increased uncontrollably. In fact, by 2002 the amount of digitally stored information was already considered to be greater than the non-digital one and the term Big Data began to be used (without having a clear origin).

In this situation, of course, we needed to reinvent ourselves looking for new techniques which allow us to tackle the growing amount of data, and yes, the need for a «Data Science» then arised.

And when we talk about it, I can visualize the data on a Petri dish. We put dye on it (excuse me biologists), we make a cut here, another one there, we put it under a microscope and observe what is happening. After all, that’s what Data Science is all about, finding ways to extract, clean, prepare and analyze data in order to give consistent and accurate conclusions.

Data Science, as described by William S. Cleveland (1943-) in his article «Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics», must be multidisciplinary, made up of various sciences where computing and mathematics allow us to face the challenges posed by Big Data. We must understand the term Big here in a broad sense, since it is not only a question of a large amount of data but also about the complexity of the models needed to understand them. This multidisciplinarity is usually represented in the form of a Venn diagram where Data Science is situated at the intersection of three sets: «The Three Legs».

Venn diagram with the «three legs» of Data Science

And at this point is when my doubts and anger rise. On the one hand, can all these skills be brought together in one person (look at the unicorn)? On the other hand, it’s all smoke and mirrors as we are dressing up purely statistical applications as Data Science?

Regarding my first question, perhaps the ideal is to create multidisciplinary teams (as I think was Cleveland’s idea). However, in order to lead such teams, with a global vision of all aspects of data management, it seems reasonable to have a specific training. This is, as I see it, what the new degrees in data science are trying to do, some of them in a very successful way, combining statistics, mathematics, computer science and notions of other areas such as law, medicine or biology.

But, unfortunately, the second question is relevant and my answer is yes, there are a lot of smoke and mirrors. For instance, in the tweet with the survey, I pointed out an example I heard of in the news. It mentioned an «architect and data scientist» who was using «Data Science tools and geo-statistical techniques», and sorry but, this words smell like nothing to me. Talking about supervised classification algorithms or Machine Learning when a logistic regression would give us just as much information about what is going on in the data. Using the «black box» of Artificial Intelligence without knowing what is it doing, besides being easy to sell, can serve to perpetuate the biases present in the data as we have already seen on more than one occasion.

Concluding

In short, yes, Data Science is a must in the 21st century. Multidisciplinary training is needed due to the new arising problems which are much more complex than the ones we had. However, we must not, in any case, forget the importance of correctly handling uncertainty. Leaving Statistics aside; writing articles about «What is your favorite Machine Learning algorithm» without acknowledging that the algorithm must be adapted to the data and the question of interest; or even, stating that «doing data science does need less mathematics than you think» is dangerous and irresponsible.

And that is why I will continue to disseminate the importance of Statistics for our society and, in particular within Data Science, as well as to denounce every time I’ll detect smoke and mirrors.

References

  • W. S. Cleveland. Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1):21–26, 2001. ISSN 03067734, 17515823. URL http://www.jstor.org/stable/1403527.

Deja un comentario

Este sitio usa Akismet para reducir el spam. Aprende cómo se procesan los datos de tus comentarios.

A %d blogueros les gusta esto: