Causal and non-causal relationships
You can always predict either, but unless you collect experimental data, you cannot establish causation except in the rare cases in which you're looking at. This is called a causal relationship. For example, A scientist in a dairy factory tries four different packaging materials for blocks of cheese and measures their. Causal inference is the science (sometimes art?) of inferring the presence and magnitude of cause–effect relationships from data.
This conflation of correlation and causation is what we will talk about in this video. First let's consider some other examples. Fido barks when his tail wags. People with higher grades in college have higher grades in high school. People who take vitamin C recover more quickly from a cold. That is, there's a correspondence between these events.
Difference between correlation and causation (video) | Khan Academy
For example, the dog, Fido, barks when his tail wags but there's no reason to suspect that there's a causal relation between these events.
While these events often occur together There are many times when Fido's tail wags and he doesn't bark and there are times when Fido barks but doesn't wag his tail. Furthermore, we may suspect that there is some common cause for these events like Fido's excitement when his owner comes home. Now that we can agree that these are cases of correlation without causation We can discuss two types of correlation, positive and negative.
In the next video we'll discuss how these types of correlations specifically relate to different types of causation.
But for now let's just introduce them. When events frequently occur together like in the examples above they are positively correlated. If two events are positively correlated Then when one event is present the others often present as well.
In our first example it being a sunny day in Arizona is positively correlated with Andy succeeding on his math test.
On the other hand, two states are negatively correlated when it's likely that when one event occurs the other will not occur. For instance, when it snows, it's often not very sunny, so snowing and sunniness are negatively correlated.
We often hear about positive and negative correlations, especially in the news. Taking vitamin C is positively correlated with recovering from the common cold more quickly than if one had not taken vitamin C.
Or headlines like "eating more nuts makes you less likely to have higher levels of bad cholesterol" indicates that eating more nuts is negatively correlated with having higher levels of bad cholesterol. You may have heard headlines like these and had conversations with some friends about them and you may have heard someone say something like, "Awesome, so I'll just like eat more nuts and get rid of my bad cholesterol.
Unless you had evidence that a causal relation held it Is a mistake to suggest that this correlation is actually a causal relation. So it'd be wrong to say that eating more nuts will cause you to have lower levels of bad cholesterol, unless you have evidence that the causal relation held.
So let's consider an example where two events are positively correlated when neither causes the other. Consider this again, people with higher grades in college have higher grades in high school.
Here, earning higher grades in college is positively correlated with earning higher grades in high school. Now, it's incorrect, as we've discussed a claim, that earning high grades in high school always causes someone to earn high grades in College. Nonetheless, earning high grades in high school may sometimes cause a person to earn high grades in college.
We can never conclude individual cause-effect pair. There are multiple reason you might be asked to work on observational data instead of experiment data to establish causality. First is, the cost involved to do these experiments. For instance, if your hypothesis is giving free I-phone to customers, this activity will have an incremental gain on sales of Mac. Doing this experiment without knowing anything on causality can be an expensive proposal.
Second is, not all experiments are allowed ethically.
For instance, if you want to know whether smoking contributes to stress, you need to make normal people smoke, which is ethically not possible. In that case, how do we establish causality using observational data? There has been good amount of research done on this particular issue.
The entire objective of these methodologies is to eliminate the effect of any unobserved variable. In this section, I will introduce you to some of these well known techniques: Panel Model Ordinary regression: This method comes in very handy if the unobserved dimension is invariant along at least one dimension. For instance, if the unobserved dimension is invariant over time, we can try building a panel model which can segregate out the bias coming from unobserved dimension.
But, because the unobserved dimension is invariant over time, we can simplify the equation as follows: We can now eliminate the unobserved factor by differencing over time Now, it becomes to find the actual coefficient of causality relationship between college and salary. And then compare the response of this treatment among look alikes. This is the most common method implemented currently in the industry.
The look alike can be found using nearest neighbor algorithm, k-d tree or any other algorithm. One of them starts smoking and another does not. Now the stress level can be compared over a period of time given no other condition changes among them. This is actually a topic for a different article in future. This is probably the hardest one which I find to implement.
Following are the steps to implement this technique: Find the cause — effect pair.
Find an attribute which is related to cause but is independent of the error which we get by regressing cause-effect pair. This variable is known as Instrumental Variable. Now estimate the cause variables using IV. Try regressing estimated cause — effect to find the actual coefficient of causality. What have we done here?
Correlation vs Causation: Understand the Difference for Your Business
Using this methodology, we come out with an unbiased estimation. Now, if we can find any information which is connected to cigarette consumption but not mental stress, we might be able to find the actual relationship.
Generally IV are regulatory based variables. This is amongst one of my favourite choices. It this makes the observational data really close to experimental design. Suppose, we want to test the effect of scholarship in college on the grades by the end of course for students. Because these students are already bright, they might continue being on top in future as well. Hence, this is a very difficult cause-effect relation to crack!
The assumption being that And the only thing which can change is the effect of scholarship. This is known as Quasi Randomized Selection. Hence, the results are very close to perfect conclusions on causality.