Thursday, August 18, 2011

How to lie with statistics

I very much like to give a short course on 'how to lie with statistics' (Check out my slides). My experience is that most people feel very uncomfortable in interpreting statistics and applying statistical methods to their data. The problem seems to be that most of the young scientists seem to think they are the only ones that do not understand statistics. But it rather seems to be the case that humans in general are not very good at thinking in probabilities. Gigerenzer showed that the following question is very hard to solve correctly, even for medical doctors:

If a normal student donates some blood and an HIV test turns out positive, how likely is it that she is really infected? To answer that you need to know the sensitivity and specificity of the test, i.e., the probability that an infected person will be detected by the test and the probability that a non-infected person will have a negative test result. Both probabilities are very high, around 99.8%. Still, the probability that a student with a positive first test is really infected is only 40/2040 = 1/51. Astonished?

80% of the medical doctors he interviewed gave the wrong numbers, another 7% got close but presumably by some obvious error, and only the remaining ones gave the correct answer.
But Gigerenzer also showed very nicely how to remedy the problem: forget about probabilities and think in natural frequencies! Then you'll notice that you need another information: in the normal population how many people are infected per year and do not know it? According to the German Robert Koch Institute this incidence rate of new HIV infections is about 3,000 per year in the whole population. So, if 1,000,000 people (out of 80 million Germans) donate blood we can expect around 40 of them to be infected with HIV. The other close to 1,000,000 donors are not infected but 2 out of 1000 of them will be tested positive nonetheless. Thus, in total there will be 2000+40 positive tests of which only 40 really point to infected donors.

Also in network analysis there were some articles regarding statistics and problems with it. Most of them warned about wrong random models or at least warned that we need to think hard about the right random graph model:
1. Artzy-Randrup et al. showed that network motifs might be evaluted differently depending on the underlying random graph model.
2. Colizza et al. show that also the rich-club effect needs to be assessed against the appropriate random network model: observing only those nodes with degree at least k,  the fraction of realized edges between these most connected nodes is called the rich-club coefficient. It was thought that a rich-club coefficient that increases with k is a sign of a rich-club effect in which those that are most connected build a dense core. Colizza et al show that even random graphs have this increasing rich-club coefficient since high-degree nodes just have a higher probability to be connected to anybody, also to other high-degree nodes. Thus, it is important to compare with a random graph model which maintains the degree sequence.
3. We have shown a similar effect in assessing the similarity of two nodes in a bipartite graph. Say you have a bipartite graph between customers and films and make an edge between two films if the customer stated she liked that film. If now two films are co-liked by 23 persons, is this statistically significant? Or if they are co-liked by 1179 persons? By choosing the right random network model, it is easy to quantify the significance of this number. (The second pair of films was 'Pretty Woman' and 'Star Wars V' and you might have guessed that their number is significantly too low given their respective popularity). See the figure for more examples.
4. Even modularity, which already checks the observed edges within partitions against the expected number of these edges in a graph, has some statistical flaws (or say, unintuitive properties) which were thankfully pointed out by Fortunato and BarthelĂ©my.

 Pairwise "similarity" of the most popular films in the Netflix data set (popular = rated by at least 30% of all customers). We have sorted the films such they fall in natural groups like action films in group 1 and chick flicks in group 5. For each pair we counted the number of customers that liked both films (rating 4 or 5 out of 5). Blue fields say that the pair of film has been co-liked more often than expected and red fields denote pairs of films that were significantly less co-liked than expected. Of course, expectation depends on the chosen random model. The classic random model (lower right) thinks that all popular films are more often liked together than expected. This includes the film pair: "50 first dates" and "The Patriot". If you also think that both films are simply wonderful, you can stop here. By using a different random model inspired from network analysis, a much more differentiated picture emerges (upper left). Here, it can be clearly seen that films from the same group are still more often co-liked than expected while different groups (e.g., 1 and 5) are distinctly less co-liked than expected.

I would like to know whether there is a similar trick as to thinking in frequencies rather than in probabilities to get the choice of the appropriate random graph model right. If you know one, let me know!

Feel free to use my slides for your own lecture on 'How to lie with statistics' (please give credit and note that all the images are under some GPL and so should be your modification of it). I also like to travel, so if I'm around and your interested in this short-course I would love to come to your institution and give the course. It takes three hours and is targeted at PhD students from non-mathematical studies in their first year.