See http://www.r-bloggers.com/simpsons-paradox-2/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

http://vudlab.com/simpsons/

“Statistics are used much like a drunk uses a lamppost: for support, not illumination.”

Simpson’s Paradox (Yule-Simpson Effect). Many people are not familiar with this phenomena, so I thought I would provide a brief example to help shed some light on it. If you are not aware of the problem of confounding or “lurking” variables in your analysis, you can box yourself into an analytical corner.

In short what is happening is that a trend or association that appears between two different variables reverses when a third variable is included. You will stumble into or at least need to be on the lookout for this effect of spurious correlation when you have unbalanced group sizes, like you often have using observational data. This is what happened in my recent discovery.

In the example below, let’s have a look at the UC Berkeley data, or at least the portion of it by Department, as provided in a Wikipedia article. https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-Bickel-11

What the data we explore contains is the number of applications and admissions by gender to six different graduate schools. Again, this is just a portion of the data, focusing in on the largest departments.

dpt <- c("A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F")
app <- c(825,560,325,417,191,272,108,25,593,375,393,341)
adm <- c(512,353,120,138,53,16,89,17,202,131,94,24)
gen <- c("m","m","m","m","m","m","f","f","f","f","f","f")
df <- cbind.data.frame(dpt,app,adm,gen)
str(df)
## 'data.frame':    12 obs. of  4 variables:
##  $ dpt: Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ app: num  825 560 325 417 191 272 108 25 593 375 ...
##  $ adm: num  512 353 120 138 53 16 89 17 202 131 ...
##  $ gen: Factor w/ 2 levels "f","m": 2 2 2 2 2 2 1 1 1 1 ...

There are a number of ways to demonstrate the effect, but I thought I would give it a go using dplyr. First, let’s group the data by gender (gen) then look at their overall admission rate.

by_gen = group_by(df, gen) 
summarize(by_gen, adm_rate=sum(adm)/sum(app))
## # A tibble: 2 x 2
##      gen  adm_rate
##   <fctr>     <dbl>
## 1      f 0.3035422
## 2      m 0.4602317

Clearly there is a huge gender bias problem in graduate school admissions at UC Berkeley and it is time to round up a herd of lawyers. On a side note, what is the best way to save a drowning lawyer? It’s easy, just take your foot off their head.

We can even provide a statistical test to strengthen the assertion of bias. In R, a proportions test can be done via the prop.test() function, inputting the number of “hits” and the number of “trials”.

summarize(by_gen, sum(adm))
## # A tibble: 2 x 2
##      gen `sum(adm)`
##   <fctr>      <dbl>
## 1      f        557
## 2      m       1192
summarize(by_gen, sum(app))
## # A tibble: 2 x 2
##      gen `sum(app)`
##   <fctr>      <dbl>
## 1      f       1835
## 2      m       2590
prop.test(x=c(557,1192), n=c(1835,2590))
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(557, 1192) out of c(1835, 2590)
## X-squared = 109.67, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.1856333 -0.1277456
## sample estimates:
##    prop 1    prop 2 
## 0.3035422 0.4602317

We see in the output a high level of statistical significance and a confidence interval with a difference of roughly 13 to 19 percent. However, beware of the analyst proclaiming truths using a 2×2 table with observational data! In this instance, the confounding variable is the department (dpt).

In order to have a look at the data by gender and by department, we will once again use the group_by() function, then calculate the admission rates and finally turn it from “long” to “wide” format.

by_dpt = group_by(df, dpt, gen)
df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))
df2
##    dpt gen   adm_rate
## 1    A   f 0.82407407
## 2    A   m 0.62060606
## 3    B   f 0.68000000
## 4    B   m 0.63035714
## 5    C   f 0.34064081
## 6    C   m 0.36923077
## 7    D   f 0.34933333
## 8    D   m 0.33093525
## 9    E   f 0.23918575
## 10   E   m 0.27748691
## 11   F   f 0.07038123
## 12   F   m 0.05882353
df2_wide = dcast(df2, dpt ~ gen, value.var = "adm_rate")
df2_wide
##   dpt          f          m
## 1   A 0.82407407 0.62060606
## 2   B 0.68000000 0.63035714
## 3   C 0.34064081 0.36923077
## 4   D 0.34933333 0.33093525
## 5   E 0.23918575 0.27748691
## 6   F 0.07038123 0.05882353

With the inclusion of department we now see that females actually have higher admission rates in four of the six departments (A,B,D,F). How can this be? It had to do with rates and the volume of admission to the various departments. Again, the groups were highly unbalanced.