Data also lies
How many times have we heard “the data says so!”, “research shows that…”, or “according to a study…”? Data has become synonymous with truth. Today it is easier than ever to access any source of data, from NASA information to city council reports. Now, like any other source of information, research and data are subject to biases in the way that they are conducted or collected, analyzed, and interpreted. Data is not neutral or objective, even though we have been told otherwise.
Data visualizations (graphs, maps, tables) are increasingly common in the media, where they accompany topics such as unemployment, budgets, or CPI (consumer price index). They are also common in political debates, where politicians support their arguments by showing graphs that are often misleading and can even be outright lies. Today, graphs and statistics are included in any argument to add “veracity”, in other words, “truth”.
Data visualizations are not neutral or objective; instead, they are cultural constructs: data must be distorted to simplify them and represent the information through shapes, positions, and colors (Mazón, 2019), otherwise, they would be unintelligible. Of course, the way data is represented directly influences how it is interpreted.
During a political debate in 2019, the Spanish politician, Pablo Casado, used a graph on indefinite contracts (left) that is far from objective. The data is correct, but the way it was visualized is not. The graph shows the evolution of the year-on-year difference and not of contracts in absolute values. According to Maldita.es, “using year-on-year rates in this type of graph is not usually representative. In addition, this graph does not show the values of the vertical axis and, therefore, it is normal to think that these are absolute values”. Maldita.es created their own graph (right) showing the absolute data and, therefore, the total number of permanent contracts. This is just one example of the techniques that are used to make data lie.
When does data lie?
1. Axes are cut or not shown
One of the most common practices when manipulating a graph is to cut the vertical axes so that they do not start at 0. This makes the bars or lines show more or less exaggerated differences using exactly the same data. It is a technique to increase distances and polarize opinion.
This practice is very common in the mass media. In 2015, the Spanish TV program, Espejo Público, showed a graph of poll results in which, in addition to cutting the axes, the length of the bars does not match the percentages. The graph on the right below is the original and shows that the difference between Pedro Sánchez and Albert Rivera is actually smaller than what was shown on the TV program.
This tweet by TVE (the Spanish state-owned public TV channel), shows a graph that is also wrong. If we look at the highlighted values, the figure of 4.1 million in 2019 is above the 2014 value of 4.4 million. At first glance, it is difficult to detect these errors since the vertical axis has been deliberately omitted and because there is no reference other than the point figures.
2. Do not show enough data
Whenever data is represented, decisions must always be made as to what is included and what is not. In fact, this is a natural process of any visualization since it is an act of synthesizing information to make it understandable. However, there are many cases in which information is intentionally excluded in order to manipulate the perception of the data.
Thus, if we want to convince an audience that a company is doing well and that sales are increasing annually, we could show a graph that only includes the years when sales have risen.
In practice, it is easy to find graphs every day that do not include enough information, especially on TV or in videos where the graph is only shown for a few seconds. TVE showed a graph on the evolution of unemployment (left) that only shows data from 2000, 2008 and 2013. The gradual evolution cannot be seen, only values that exaggerate the difference between the bars. On the right, is the same graph containing all the information year by year.
3. Misleading percentages
You have probably heard statements such as “80% of dentists recommend this toothpaste”, or “9 out of 10 people recommend it”. An extremely important consideration when reading a percentage is the size of the sample. How many dentists were asked? How many people were polled? Asking ten dentists is not the same as asking a thousand.
It is also crucial that we consider the conditions of the sample. One of the issues that has caused the most confusion following the Covid-19 vaccination is the percentage of vaccinated people who have been hospitalized by the virus: “Who says that the vaccine is effective?” Beware of this type of statement. Obviously, the percentage of hospitalizations due to Covid-19 among vaccinated people will be higher because the vaccination rate is very high.
In Spain, for every 10 people hospitalized, 7 people are vaccinated with the complete regimen, and 3 have not received the vaccine. (CIRCULO: Hospitalized)
Although more vaccinated people are hospitalized than unvaccinated, the proportion of hospital admissions is vastly higher among the unvaccinated.
PhD mathematician, Javier Álvarez Liébana, spreads the word through the networks. He explains, “There will come a time when the vast majority of those hospitalized for Covid-19 will be vaccinated people, just as most of those hospitalized for car accidents were wearing seat belts. Claiming that neither seat belts nor vaccines work is a statistical fallacy called survivor bias.” We should remember that, in the absence of vaccines and seat belts, the number of hospitalizations and deaths would be much higher.
4. The data suggests something that is not correct
The case of Pablo Casado’s graph during the political debate is a good example of omitting data. The vertical axis is suppressed and if we don’t know the parameters, we do not really know what we are talking about, which leads to the audience to incorrect conclusions.
Relating two things as cause-effect often leads to erroneous messages. For example, some might say that more people are vaccinated in cities with a higher average income. The reason the vaccination rate is higher may well be because of the city’s relative wealth; however, it could be because life expectancy is higher and, therefore, there are more older people who have been vaccinated earlier. This error, called “spurious correlation” or false correlation, occurs when two or more events are thought to be (or presented as) related but, in fact, it is only coincidence or because of a third, unseen factor.
Tyler Vigen’s website, Spurious correlations, demonstrates the incredible coincidences between two variables that are clearly not cause and effect. In the example below, Vigen shows a correlation between margarine consumption and the divorce rate in the state of Maine in the United States.
Here is another false correlation between the number of people who have drowned after falling into a swimming pool and the films in which Nicolas Cage has appeared. The coincidence is high, but there is clearly no cause-and-effect relationship between the actor’s appearances and these deaths. Let’s remember this mantra: correlation is not causation.
5. Collection of erroneous information
Voluntarily or involuntarily, there are often errors in the collection or processing of data. In some cases, data from different official sources does not match. In other cases, the day of data collection affects its interpretation, as in the example of new Covid-19 positives. Cases are always higher on Mondays, since over the weekend the data is not updated. Therefore, new weekend cases end up being added to Mondays.
There are even times when data is counted twice. This is what happened in this article, which was corrected because the data on the Covid-19 vaccines that had to be launched in Europe were counted twice: once for the European Union as a whole, and once for each European country. These errors may be totally unintentional, but in some cases, they may be premeditated with the aim of confusing the public.
How can we combat misinformation with data?
Remember, before drawing any conclusions from a graph or other statistical data, we must take a good look at all the information that appears and ask ourselves:
- Who collected it?
- What is the context?
- What is the source?
- What is shown, and above all, has anything been left out?
Mazón, Pablo Rey (2019): Mentir con datos: manipulaciones y mentiras blancas
National Geographic (2018): Why your mental map of the world is (probably) wrong (https://www.nationalgeographic.com/culture/article/all-over-the-map-mental-mapping-misconceptions)
Cairo, Alberto (2019): How Charts Lie: Getting Smarter About Visual Information (https://www.youtube.com/watch?v=Low28hx4wyk)
Prieto, Gonzalo. ¿Mienten los mapas electorales?, ¿votan las personas o los territorios?, Geografía Infinita (https://www.geografiainfinita.com/2020/11/mienten-los-mapas-electorales-votan-las-personas-o-la-tierra/)