Saturday, November 12, 2016

Note 14: Network analysis and statistics

Network Analysis has made heavy use of statistics - and it seems that statistics is not humankind's strength. That does not make network analysis any easier.
This figure is under CC:BY with a reference to Prof. Dr. Katharina A. Zweig or to this blogpost.


A main part of network analysis is computing and interpreting statistical numbers. Unfortunately, statistics is not the most intuitive part of mathematics and it is well-known that even trained scientists have problem in correctly interpreting statistical results. Consider the following test:

A group of young students without any symptoms of sickness make a blood donation. Routinely, their blood is checked for an HIV infection. The test is very sensitive. For simplicity, assume that it detects 99.99% of all infected persons and that non-infected persons will get a negative test with 99.99%. If now a person's first test returns a positive result, what is her likelihood to actually be infected?

If you think it is 99.99%, you are in very good company (but wrong):

Note 14. Gigerenzer and his team showed in various
studies that almost none of the experts was able to give
the correct answer . Most answered that, as the test is
so speciļ¬c and so sensitive, the probability that a person
is infected if the test says so is 99.99%.
(Zweig, 2016)

 Actually, the question cannot really be answered without knowing the chance that a person without any symptoms is infected. This probability can be approximated by the so-called incidence rate, the number of new infections per year. For Germany, this is around 3,000 in a nation with about 80 million inhabitants (for young people, it might actually be higher than for the general population, but as an approximation that is fine).

We now want to know the probability that a person is infected if her test turns out to be positive. There are two ways for a positive test: the person is infected and detected or the person is not-infected but falsely flagged. If we would test whole Germany, we would in essence find all of the 3,000 newly infected persons. However, from the remaining (still roughly) 80 million people, we would flag 0,01%, i.e., 1 person in 1 in 10,000. Thus, we additionally flag 8,000 people as positive. From all 11,000 people with a positive test, 8,000 would actually not be infected. I.e., the probability that a person with a positive test is infected is less than 50%, namely around 27%. Surprised?

This computation has a very important consequence. Let our 'null-hypothesis' be that any given person is not infected. Now, we know that the probability that a person is not infected and gets a positive test is very small - this value is called her p-value (probability to observe the data given the assumption in the null-hypothesis). Especially, it is smaller than p=0.05, the classic threshold value to 'reject the null-hypothesis'. However, as we have seen, we need to compute the probability that the person is infected given a positive test result. And this probability can differ strongly from the other one when the ratio of the two classes (infected vs not infected) is not around 0.5. Thus, rejecting a null-hypothesis, just because given the assumption ("not infected") the observation of the data ("positive test") is unlikely, is the wrong way.

Note 15. The only correct verbal descriptions of a p-value need to
contain the words given that the null-hypothesis is true as
the p-value conditions on that. As the p-value does not
say anything about the probability of the hypothesis to
be true, given the observed data, it cannot be used as a
basis for rejecting the null-hypothesis.
(Zweig, 2016)

It is just the first step to update our probability of the assumption, given the observed data. This will be important, e.g., to identify network motifs.

If statistics is already hard, then this makes network analysis no way easier! Read more about statistical hypotheses testing on Wikipedia. Or join my Mendely group on "Good statistics papers for non-statisticians".



 References:

(Gigerenzer, 2007) Gerd Gigerenzer, Wolfgang Gaissmaier, Elke Kurz-Milcke, Lisa M. Schwartz, and Steven Woloshin. Helping doctors and patients make sense of health statistics. Psychological science in the public interest, 8(2), 2007.

 (Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

No comments:

Post a Comment