Authors: Judah Pearl, Dana McKenzie

The Book of Why: The New Science of Cause and Effect Judea Pearl, Dana Mackenzie 2018

**To finally understand whether coffee is harmful** (The Book of Why)

The Book of Why: The New Science of Causation is an adaptation of Judah Pearl’s scientific publications for the general public. Why read something popular about statistics? The reason is purely selfish (in the literal sense of the word): we regularly encounter the fact that “ British scientists have established”, “British scientists have refuted what they have established the day before” or “British scientists hold directly opposite points of view”. Everything would be fine if these “discoveries” did not concern products of regular use (for example, face creams and serums) or an assessment of our daily activities (jogging in the morning or in the evening, running or rocking – which is better?).In order to isolate more reliable sources from a pile of information, it is necessary to understand the statistical principles of “correct” research.

The other two points: the modeling of cause-and-effect relationships is extremely important for the development of artificial intelligence, so the theoretical developments in this area roughly indicate the logic that the development of artificial intelligence will follow (and we all still wonder when robots will replace us). The book also gives an insight into the theoretical discussion within statistics as a science.

Considering all this, it is impossible to say that the book is easy to read, of course. But still, when you get used to the formulas (the authors did not succeed completely without them), reading becomes pleasant. It should also be noted that the author does not speak very tactfully about many scientists involved in statistics.

A few words about artificial intelligence: Judah Pearl is convinced that the development of artificial intelligence-based on cause-and-effect relationships is the only right way. Its advantage over deep learning is that cause and effect relationships are transparent, while deep learning is not. So, although Google’s AlphaGo program beats professional Go players in what seemed impossible (unlike chess, there are too many options and it’s impossible to learn them all), the developers do not know how it works. Judah Pearl is sure that robots should understand the subjunctive mood because only it allows you to communicate with people and guarantees the possibility of learning from past mistakes.

**What’s wrong with regular statistics?**

Traditional statistical methods generally show correlation but not causation. This truth is hammered into the minds of all students in statistics courses. Traditional statistical methods have revealed many patterns, but they seriously limit the possibilities of knowing the world in the 21st century. After all, correlation sometimes not only misleads us (rooster songs at dawn are by no means the reason for sunrise) but also does not allow us to answer such questions: “What is the main reason for the patient’s recovery?”, “What would happen if the population drastically reduced consumption of alcohol?”, “What happens if you change the tax rate?”. And many others, for which it is impossible to conduct an experiment with a control group to get an answer. (The latter has become the standard in medicine and is gradually spreading to other areas.)

The absence of a conceptual apparatus to reflect cause-and-effect relationships is the main reason for this situation. At the same time, questions from the series “What if?” is an integral part of our thinking. In all areas of life, we are guided precisely by the analysis of what is happening and reflections on what will happen if we act one way or another. Imagination is the most important factor in the formation of man and the development of society, as Yuval Harari showed in his book “Sapiens. A Brief History of Mankind.

To enrich the statistical apparatus, Judah Pearl offers arrow charts (we will discuss them in detail below). Point X and point Y are connected by an arrow, the tip of which indicates which indicator “listens” to the other. Judah Pearl was not the first to graphically represent the relationship between two events. The cause-and-effect revolution took place gradually over more than half a century.

Thanks to the arrows – this innovation only at first glance seems like rubbish, but in fact, it requires non-trivial logical abilities (do not relax) – the analysis has reached a new, third level. The ladder shows the previous two: the first is a correlation, we are just observing what is happening (yes, big data analysis and artificial intelligence are on it), at the next level we will think about the consequences of our actions, that is, we intervene (here are studies with control groups), on the third – the transition to the subjunctive mood, when to answer the question “What if?” all that is required is data and arrows cleverly multiplied by familiar statistical methods.

**Correlation is not causation. Is it?**

English anthropologist, geographer, and psychologist Francis Galton (1822–1911) were one of the first to analyze heredity. He looked at the growth of fathers and their sons, and identified a pattern known as “regression to the average.” That is, it is highly likely that a tall father will have a shorter son (and vice versa). If this were not the case, then the average population growth would change significantly, but it remains stable. To illustrate this process, he designed a “Galton board”. If you throw one ball, then it is difficult to predict its placement, but with a thousand, the total distribution is now known to statisticians.

Rice. 1. Board Galton

The idea of the “double” correlation thus revealed (you can take the height of the father or son and assume the height of the son and father, respectively), which is not a causal relationship, was taken up by the English mathematician, statistician, biologist and philosopher Karl Pearson (1857–1936). Pearson saw in this approach an opportunity to bring the humanitarian and social fields of knowledge (for example, psychology) to the level of exact sciences because a rigorous mathematical methodology appeared. At the same time, he considered analysis and attention to causal relationships unnecessary and fundamentally wrong. Pearson founded the scientific journal Biometrika, still a leading journal in the field of statistics. And thanks to this (and also the active development of the statistical direction in line with correlation analysis), Pearson had a significant impact on that “correlation does not reveal causation” has become a de facto axiom. With opponents of this point of view, he actively fought by all means available in the scientific community.

Despite the latter, the American geneticist and statistician Sewall Wright (1889-1988) actively relied on the idea of causation when he analyzed the color of guinea pigs. So, he explained, taking into account what factors it was possible to predict the color of a pig if the color of the skin of its ancestors is known. In doing so, he used an arrow diagram. However, the approach to designate cause-and-effect relationships in this way did not take root at that moment. Sociology developed structural equation modeling, and economics developed a system of simultaneous equations that allowed causality to be taken into account.

Why was all this important to scientists? Scientific approaches are based on philosophical concepts. So, Pearson was an adherent of positivism and therefore believed that science should be based on objective data, facts, and figures, that is, statistics. While there is an obvious subjective element in building causal models with gilts: after all, the scientist initially decides which factors could influence and includes them in the model, these factors are not in the data on the color of the skin of gilts.

Judah Pearl, proposing a methodology for taking into account cause-and-effect relationships, is sure that reliance on facts already known to us when building models is not only acceptable but also desirable. We must use common sense. Thus, he continues the eternal debate within the scientific community about how much subjectivity is acceptable in science.

**Nowhere without Holmes**

As you remember, Sherlock Holmes lined up events based on clues and scraps of information and found the reason for what was happening, discarding incredible and less probable explanations. How to do this in mathematical language was formulated by Thomas Bayes (1702-1761). Thanks to Jude Pearl, Bayesian probability became widespread in the 1980s and is now used in artificial intelligence (neural networks, etc.). For example, it underlies DNA identification of victims of tragedies, even if the DNA of only distant relatives is known.

The Bayes formula helps to understand the real probability of the veracity of the diagnosis. For example, when determining breast cancer in women, a false-positive diagnosis is quite common. The proportions of actually cancer patients, the total number of those examined, and those who received a “positive” and “negative” result as a result of the examination, are substituted into the formula. The probability that a woman with a positive diagnosis has cancer is less than one percent (however, heredity, age, etc. should be taken into account, averaged data are given here).

Models based on the Bayes formula are suitable if A ⇒ B ⇒ C. However, it is not uncommon for casual relationships to more likely fit the scheme A ⇐ BC or the scheme **A** ⇒ **B** ⇐ **C.**

**What to do with distortion**

In statistics, there is such a thing as a distorting factor. For example, if we want to find out how walking (X) affects life expectancy (Y), then we should not forget that the “age” factor (Z) affects both the intensity of walking and life expectancy (an 80-year-old man doesn’t walk fast and probably won’t live as long as a 20-year-old).

Therefore, in the calculations, the Z factor is “controlled”. One way is a randomized controlled trial (first tried in 1923-1924 in agriculture, when the field was divided into squares and this or that fertilizer was tested in random order). But since it can be difficult to distinguish between those factors that only correlate with each other and those that really influence, that is, “do” something, it sometimes happens that scientists control the wrong factors (or even those whose influence they want to analyze).

There are different manifestations of the Z-factor.

Z looks like a confounding factor, but it is not. In this case, Z is a mediator, that is, this factor only explains how X affects Y (no need to control).

In this case, Z is a proxy of the mediator M (no need to control).

In this case, no variable needs to be controlled when analyzing the influence of X on Y (there is not a single factor that would simultaneously affect X and Y and therefore would not allow one to establish a net influence of the first on the second).

In this case, it is necessary to control B, if this is not possible, then only a randomized controlled trial is possible.

In this case, none of the factors need to be controlled (although it is not uncommon to try to control B, this is called M-bias).

In this case, you need to control the variable C.

**In tobacco smoke, you can’t see a single thing**

In the first half of the 20th century, the proportion of smokers increased sharply, it not only became fashionable but also the industrial production of cigarettes made it possible to smoke more of them, and the smoker no longer spent time spinning. Tobacco companies ran aggressive advertising campaigns.

Today it is scientifically proven that smoking is the cause of lung cancer. But it took years to prove. The first studies on the dangers of smoking appeared in the late 1940s. However, there were two factors that were heavily criticized by opponents: (a) the studies were retrospective (that is, they asked the question “How actively did you smoke?” and it was likely that the respondents answered inaccurately), (b) it was suggested that there was a special gene that responsible for the fact that some people are more prone to cancer when they smoke, or that this gene leads to more active smoking.

Given the ethical side of the issue, randomized experiments with control groups were not possible. Therefore, longitudinal studies were launched, which after five years showed that smokers were much more likely to develop lung cancer.

During the discussion about the dangers of tobacco in the 1960s, a definition was formulated when, with an observed correlation, one can speak of causal relationships (after all, at that moment, classical statistics recognized the only correlation and refused to see sometimes causal relationships hiding behind it) – so-called Hill’s criteria (only a few can be observed, initially there were five factors, later several more were added).

• sustainability: many studies conducted in different settings show the same result;

• strength: the association between action and effect must be strong;

• specificity: one specific factor causes a specific effect;

• temporal dependence: the effect always follows the cause; • Validity: the revealed regularity does not contradict other knowledge in this area obtained in other studies.

As a result, since the 1970s, an active policy has been pursued to reduce the proportion of smokers in developed countries (banning advertising on TV, etc.).

By the way, decades later, researchers found that some people have a gene that is really responsible for the more active development of cancer cells when smoking, but its influence is so insignificant that it is impossible to explain the sharp increase in lung cancer in the first half of the 20th century.

Back in the 1960s, a pattern was found that premature babies of smoking mothers were more likely to survive than premature babies of non-smoking mothers. Is smoking good for you?

No, in fact, “newborn weight” was misused as a factor, when in fact it was a mediator that could indicate both that (a) the mother smoked, (b) other serious illnesses of the child. Accordingly, in the latter case, the mortality was higher, while the proportion of women in labor who smoked was small.

**Several paradoxes**

Graphical diagrams are designed to help in situations where it is difficult to offhand calculate the probability of events, especially if there are distractions or new information that appears that a person forgets to take into account and adjust the probability calculation.

**The Monty Hall paradox. **Illustrates just the last situation. On the American TV show Let’s Make a Deal, the contestant stood in front of three closed doors. Behind one was a car, and behind the other two, was a goat. The first move was for the participant. He chose one of the doors (the door was not opened). The second move – the leader opened one of the other two doors, behind which there was no car. The third move – the player chooses from two options – opens the door chosen at the first step or the one that the leader did not open.

The best thing is to change the door in the third step. Since at the first step, the probability of hitting was 1/3, but after obtaining additional information (step two) it is necessary to recalculate the probability, when changing the door it increases to 2/3. (The probability would not change only if the leader opened any door – but he always opened the one behind which there was no car).

If the player chooses the first door in the first step, consider all possible cases:

**Berkson’s paradox. **It is observed when two events independent of each other with the conditional onset of the third seem to be interconnected. For example, for men in search who actively invite women of interest to them on dates, it may seem that beauties are especially stupid. But this is not so, because they simply do not invite ugly ones. This paradox is seen most often in medical research, where, for example, two rare diseases are positively correlated among hospitalized patients, although this pattern is not observed in the general population.

**Simpson’s paradox. **It happens that in two groups of data there is the same dependence (the drug does not help), but when they are combined, the dependence is the opposite (the drug helps). So, 5% of women in the control group survived a heart attack, while in the group taking the medicine, they were 7.5%. Men have a similar situation: 30% in the control group and 40% in the drug group. But when the sets are pooled, it turns out that 22% of the control group versus 18% of the drug users survived a heart attack. This happens because the choice of the gender factor is incorrect, especially since heart attacks are more common among men.

**What to do if not everything is known**

Everything described above was actually about the first rung of the ladder of causation. Now we will talk about the second one when you can compare the factor do (Y) and just (Y) or enter the action factor in the analysis field.

More about smoking

As already mentioned, the gene that promotes cancer in smoking does exist, and one might wonder how to determine the effect of smoking on the development of cancer if there is no way to measure the influence of this gene?

In this case, the effect of smoking on the accumulation of tar in the lungs is analyzed, as well as data from a control group of non-smokers. Next – is the likelihood of cancer with a particular amount of resin in the lungs. Thus, the “tar” indicator is introduced, on which there cannot be a direct and significant influence of the gene, but it is affected by “smoking”.

How do you get cholera?

In 1854 there was a cholera outbreak in London on Broad Street. At that moment, doctors did not know how infection occurs, it was widely believed that it was airborne. Dr. Snow was able to identify that cholera spreads through water. He analyzed all cases of infection and found several cases where people living in other areas became infected, but they rarely visited Broad Street – they came only for water. At the same time, not all houses on the street had cases of infection.

So, graphically speaking, Snow introduced an external variable that affected water quality: the water supply company. It turned out that two companies served the street. One drew water above London, the other below London along the Thames. It was the water of the latter that was contagious.

**How to deal with the subjunctive**

Philosophers and representatives of various sciences often thought about what to do with reasoning about the hypothetical: how to write them down and whether it can be done (if we are talking about statistics), whether it follows from such reasoning that hypothetical possibilities really exist somewhere (if we can imagine them). But for Judah Pearl, this is not so important. It is important that a person constantly operates with such ideas and builds his actions on this basis (whether it is questions of ethics or everyday purchases). Therefore, the main question is how to optimally write down such a way of thinking in a schematic form suitable for modeling and for artificial intelligence.

In statistics, methods have been developed for filling cells with question marks. That is: how much would Alice earn if she graduated from university? The first way is to find a perfect match, the second way (if there is no perfect match) is an approximate match, and the third way is a linear regression.

A linear regression would follow something like this: the starting point would be the salary of a person with no work experience and no education ($65 thousand), then it was revealed that for each year of work experience the salary increases by $2.5 thousand, then the identified increase is added for education – $ 5 thousand. As a result, we would come to the conclusion that with a university degree, Alice would earn $ 85,000.

However, linear regression does not take into account the fact that the length of education affects the length of work experience. If this were taken into account in linear regression, then the answer would be $76,000. Of course, this is also a probabilistic value, but it seems to be closer to reality than the option when the effect of education on work experience is not taken into account.

**What if the factor has an indirect effect?**

Quite often there is a debate about what is more important for a high IQ of a child – the IQ of parents or their social position. Therefore, often the variable “social status” can be controlled. At the same time, it is obvious that “social status” is a proxy (mediator) of indirect (X on the graph) and direct impact of the “Parents IQ” variable.

Another example is when efficiency calculations were made without a clear understanding of cause-and-effect relationships and influencing factors. In the 1990s, Chicago schools, which were seriously lagging behind the American average, began the Algebra for All program: all ninth graders were required to take the full math course required for college admission.

A simple analysis of the performance of Chicago schoolchildren (comparing the years of graduation “before” and “after” the start of the program) showed that the program was successful. However, as every teacher knows, it is very difficult to maintain a high level of academic performance if there are children of different levels and interests in the subject in the class. The environment has an effect. When the researchers took this factor into account, it turned out that the program had a less obvious positive effect, and the increase in grades was explained only by changes in the teaching methodology in earlier classes (they also had a reform that the first researchers did not take into account). When these results became apparent, the Algebra for All program was reformed: the lagging behind had to attend twice as many classes as the good performers in the subject.

**10 best ideas on one page**

1. Traditional statistical methods generally show correlation but not causation. To reflect cause-and-effect relationships in statistics, there is no conceptual apparatus.

2. Correlation sometimes misleads us, and also does not help answer questions that require experiments with a control group (for example, why did the patient recover?).

3. To enrich the statistical apparatus, Judah Pearl proposes to represent the relationship of events graphically (using diagrams with arrows, where the tip shows which indicator “listens” to the other).

4. The desire of traditional statistics to analyze data without taking into account the life experience and knowledge of the analyzer is fundamentally wrong. The technique proposed by Judah Pearl solves this problem. Common sense is the basis of any analysis.

5. The history of the fight against smoking shows that the separation of scientific practices from common sense and reality can become dangerous for society because the active public policy could start several years earlier and save lives.

6. Common sense is necessary, but it is not enough. A clear analysis algorithm is needed, since it is easy to mislead a person when calculating probability and other things, as paradoxes (Monty Hall, Berkson, Simpson) show.

7. The development of statistics in the direction proposed by Judah Pearl will allow better response to medical diagnoses and choice of further treatment protocol (as the example with the accuracy of breast cancer diagnoses shows).

8. Common sense is necessary because it is possible to understand whether some factor distorts the results or not, only by logical reasoning.

9. Judah Pearl partly explains the identified errors in traditional statistics by the fact that scientists relied on the philosophy of positivism, according to which science should be based on objective facts and figures.

10. Judah Pearl is confident that the development of statistics in the proposed direction will allow artificial intelligence to reach a new level since until now it has been based mainly on traditional data analysis – and this is significantly different from the algorithm of the human thought process.