Data Science

Coup Data: the end of science?

What are the consequences of the use of Big Data on the scientific method and understanding reality? Data mining is not geared towards measuring reality using causal and deductive means the same way traditional statistics are. Data mining collects and streams all types of data in infers decisions by correlation. Profiling aims at developing an understanding of what might become, rather than what is. Namely, profiling risks were relegated to the domain of subjective belief, which are now being discarded in the name of a digital truth.

The massive collection of data has created a new positivist paradigm, which was introduced in the 2008 publication “The Fourth Paradigm”, where Microsoft researchers announced the end of scientific theory. According to them, there will be no need to hypothesize nor to create models of the world, our data will speak for itself. Mass exploitation of data will allow us to reveal truths about the world that were previously immeasurable.

The “End of Theory” is an idea largely proported by Chris Anderson, founder of celebrated technology magazine Wired where he posits: “We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. [...] Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”

The Western rationale of finding causality in phenomenons is losing its influence. Big data imposes a purely inductive rationale. Correlation replaces causation. The scientific paradox of algorithms resides in their capacity of enabling us to study new complex phenomenons that were previously unobservable, without really understanding their causes. Correlation is not equivalent to causation, just because event A and B co-occur, does not mean A is the cause of B or vice versa.

To highlight the sometimes comical nature of correlation analysis, we can look at a study done by Harvard student Tyler Vigen which revealed absurd phenomenon. He dedicated his website to a phenomenon called spurious correlations, which demonstrate the absurd difference between correlation and causation. In the United States, there exists for example a correlation of 99.26% between the rate of divorce in the state of Maine, and the rate of margarine consumption per capita. The correlation between the number of people that drown in a swimming pool and the number of films starring Nicolas Cage per year is 66.6%. Finally, there is a correlation of 95.86% between the consumption of Mozzarella per capita and the number of PhDs in civil engineering. Ever since that study, it only suffices to be a cheese aficionado to secure one’s entry into Polytechnique…

Big data carries with it the belief that reality is measurable through real time collection and measurement of signals without really interpreting or questioning the content of the data. We submit to big data in a quasi mystical state. Big data is suddenly an oracle. This new idol permits us to access to the very fabric of truth. In reality, the advancement of this approach gives us access to previously imperceptible phenomenon, which explains in part why we always strive to rely on machines.

As such, big data is a reaffirmation of the nominalist philosophical discourse for which nature is understandable directly, without hypothesizing or creating world models. Consequently, we come to govern through a statistical expression of reality that does not reference causality, intentionality, and gives way to erasing the existence of the individual. We transition from a probabilistic logic of prevention with reference to the standards of the “average human” - plague, leprosy, foreigner, weight, abnormal - to a logic of preemption, that attributes the environment and behaviors to a potential risk: bad payment credit, potential terrorists, etc.

By striving to eliminate incertitude, we eliminate differences between individuals. The legal scholar Antoinette Rouvroy claims that: “Preemptive politics consists of treating the event in question as if it has already taken place or will take place immediately, and hence imposes consequences in advance (refusal of insurance, preemptive elimination of terrorists, professional guidance etc..)”. The ignorance of causality condemns us to repeat the prejudice of the past, to pass incontestable judgement, as well as the resignation and disappearance of public life. Politics is fundamentally taking decisions in light of incertitude. The law is not intended to solely and directly capture the world of facts.

The difference between facts and the law, is precisely the difference between what is and what ought to be. When evaluating a case, lawyers have a habit of qualifying the facts of a case, and then applying the guidelines of applicable laws. As such, lawyers operationalize a link between observed facts and existing norms. By measuring reality without interpreting it or attributing causality, the qualification of facts is eroded and judicial interpretation is set aside. In the matter of law, big data substitutes facts. This confusion in reality and truth, deprives the law from its mission to find a compromise between the contradicting interests in society.

The contemporary representation crisis is exacerbated by the advent of big data. We are confronted with a new problem: How are we represented in the world? Will we be solely represented by our data and be reduced to a quantified version of ourselves?

No need for experiments, research, theorizing, it suffices to collect and analyze raw data, and mathematical realities will appear on themselves. This idea has been previously theorized and invalidated. French philosopher, Auguste Comte already explained that the data society produced is not fully representative of that society and criticized that probability and statistics will not be able to fully capture the complexity of human behavior. A data point is inherently captured at a specific context, with a proper dimension to the moment, place and methodology of its recording. A data point is also not fully verbose. It is also silent, it contains a part of reality that is not expressed. Its treatment cannot be finalized without certain choices, prejudice, bias. The new science of data cannot be separate from the socio-historical processes that led to its own creation.

Adrien Basdevant

See another angle of the Kaleidoscope