Saturday, November 12, 2016

Note 16: On distances, edge weights, and other modeling decisions

(Graph theoretic) distance is certainly one of the best understood
concepts in network analysis. However, not every weighted graph
should be used to compute the distance between all pairs of nodes.
This figure is under CC:BY with a reference to
Prof. Dr. Katharina A. Zweig or to this blogpost.
The distance between any two nodes in an undirected, unweighted graph is defined as the minimal number of edges one has to traverse, to get from one node to the other. A natural generalization of this concept to weighted graphs defines the distance as the minimal sum of the weights on any sequence of edges between the two nodes.

For weights that represents the length of streets, this makes perfectly sense. But of course, weights in complex networks can represent almost anything. Let's consider the number of hours two people called each other in the last two weeks. Or the probability to surf from one webpage through another by a direct link from the first to the second page.

The distance between two nodes in a complex network is used for many things, foremost so-called centrality indices like the betweenness centrality or the closeness centrality. In general, for most centrality indices, a low distance to many other nodes will make a node more central. However, the length of calls between two persons is rather a measure of their closeness, not their distance. Thus, summing up these values will actually favor those pairs of nodes, who are linked by paths with people that do not call each other for a long time. It can help here, to invert the weights to make the meaning of distance more intuitive. However, even in this case: what does it actually mean if I am connected to another person by two other persons who talk to each other for two hours each? Then my distance to that guy is 1/2 + 1/2 = 1.  Does that make me closer to that guy than being directly connected to another person, which I only call for 10 minutes, i.e., with a "distance" of 6?

With the probabilities, a summation obviously makes not much sense. Here, a multiplication of the weights might be most meaningful to yield interpretable results.

Note 16. If network representation and network ana-
lytic measure are well matched, the measure’s value can
be interpreted with respect to the functionality of the net-
work for the complex system of interest.
(Zweig, 2016)

Read more on this on the general keyword of "trilemma of complex network analysis" and in Chapter 5 "Network representations of complex systems" of my book.




Reference:

 (Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

Note 14: Network analysis and statistics

Network Analysis has made heavy use of statistics - and it seems that statistics is not humankind's strength. That does not make network analysis any easier.
This figure is under CC:BY with a reference to Prof. Dr. Katharina A. Zweig or to this blogpost.


A main part of network analysis is computing and interpreting statistical numbers. Unfortunately, statistics is not the most intuitive part of mathematics and it is well-known that even trained scientists have problem in correctly interpreting statistical results. Consider the following test:

A group of young students without any symptoms of sickness make a blood donation. Routinely, their blood is checked for an HIV infection. The test is very sensitive. For simplicity, assume that it detects 99.99% of all infected persons and that non-infected persons will get a negative test with 99.99%. If now a person's first test returns a positive result, what is her likelihood to actually be infected?

If you think it is 99.99%, you are in very good company (but wrong):

Note 14. Gigerenzer and his team showed in various
studies that almost none of the experts was able to give
the correct answer . Most answered that, as the test is
so specific and so sensitive, the probability that a person
is infected if the test says so is 99.99%.
(Zweig, 2016)

 Actually, the question cannot really be answered without knowing the chance that a person without any symptoms is infected. This probability can be approximated by the so-called incidence rate, the number of new infections per year. For Germany, this is around 3,000 in a nation with about 80 million inhabitants (for young people, it might actually be higher than for the general population, but as an approximation that is fine).

We now want to know the probability that a person is infected if her test turns out to be positive. There are two ways for a positive test: the person is infected and detected or the person is not-infected but falsely flagged. If we would test whole Germany, we would in essence find all of the 3,000 newly infected persons. However, from the remaining (still roughly) 80 million people, we would flag 0,01%, i.e., 1 person in 1 in 10,000. Thus, we additionally flag 8,000 people as positive. From all 11,000 people with a positive test, 8,000 would actually not be infected. I.e., the probability that a person with a positive test is infected is less than 50%, namely around 27%. Surprised?

This computation has a very important consequence. Let our 'null-hypothesis' be that any given person is not infected. Now, we know that the probability that a person is not infected and gets a positive test is very small - this value is called her p-value (probability to observe the data given the assumption in the null-hypothesis). Especially, it is smaller than p=0.05, the classic threshold value to 'reject the null-hypothesis'. However, as we have seen, we need to compute the probability that the person is infected given a positive test result. And this probability can differ strongly from the other one when the ratio of the two classes (infected vs not infected) is not around 0.5. Thus, rejecting a null-hypothesis, just because given the assumption ("not infected") the observation of the data ("positive test") is unlikely, is the wrong way.

Note 15. The only correct verbal descriptions of a p-value need to
contain the words given that the null-hypothesis is true as
the p-value conditions on that. As the p-value does not
say anything about the probability of the hypothesis to
be true, given the observed data, it cannot be used as a
basis for rejecting the null-hypothesis.
(Zweig, 2016)

It is just the first step to update our probability of the assumption, given the observed data. This will be important, e.g., to identify network motifs.

If statistics is already hard, then this makes network analysis no way easier! Read more about statistical hypotheses testing on Wikipedia. Or join my Mendely group on "Good statistics papers for non-statisticians".



 References:

(Gigerenzer, 2007) Gerd Gigerenzer, Wolfgang Gaissmaier, Elke Kurz-Milcke, Lisa M. Schwartz, and Steven Woloshin. Helping doctors and patients make sense of health statistics. Psychological science in the public interest, 8(2), 2007.

 (Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

Notes 12 and 13: Evolution of complex networks

Networks evolve under different constraints and forces, in which
; network science is especially interested.
This figure is under CC:BY with a reference to
Prof. Dr. Katharina A. Zweig or to this blogpost.




The perspective of statistical physics is always on finding universal forces that form and determine a given phenomenon. From this perspective, network science originating in statistical physics was also always searching for the forces on either the entities or the whole system that govern the evolution of a network's structure.

"Note 12. Network science describes the topology of a network as the effects of forces on either the entities or
the whole system.
" (Zweig, 2016)

For example, the "smallness" of small-worlds can be explained by minimizing cost (energy) and the diameter at the same time, under the assumption that an edge between more distance nodes is more expensive. Putting some energy in long-distance edges is the optimal way to reduce the diameter of the network with the smallest total cost involved.

Looking at a dynamic system, statistical physis is then most interested in two phenomena: phase transitions and equilibria. Phase transitions indicate the point, where a small shift in one parameter radically changes the way the system behaves. As such, this is not a phase in which it is easy to understand the forces and constraints under which the networks are built. It is more a state of 'criticality' of which scale-free distributions are a tell-tale sign. When the "scale-free" degree distribution was observed (which actually is in most cases not directly scale-free) it seemed as if complex networks were in a state of 'self-organized criticality', i.e., a system that keeps itself in this state. This would have been very interesting because it points to a certain equilibrium of forces. In any case, the other most interesting state is the equilibrium, which can be either a stable one or an instable one. In the first case, small perturbations will be quickly diminished and the system returns back to the equilibrium state. In an instable one, a small perturbation will lead to a totally different state. In equilibria it is often easier to better understand the forces and constraints that form the system's behavior.

"Note 13. Network science is interested in equilibrium
structures as they can be used to understand the forces,
constraints, or incentives under which a network is built.
" (Zweig, 2016)

Presumed constraints are, for example, Dunbar's number in social networks, i.e., the observation that people might not be able to manage more than 150 close acquaintances. Reasonable forces are energy/cost minimization for most networks, i.e., that the number and length of edges to be built is limited or should be minimized while - in most cases - building connected and maybe even locally dense networks.

You can find more on the perspective of statistical physics in the 2nd chapter "Universal structures vs. individual featuers" of my book. There I compare it with the perspective of social science and why I believe that Network Science is really something different than (Social) Network Analysis.


Reference:

 (Zweig2016) Katharina A. Zweig: Network Analysis Literacy, ISBN 978-3-7091-0740-9, Springer Vienna, 2016

All figures are under CC:BY with a reference to Prof. Dr. Katharina A. Zweig or to this blogpost, if not mentioned otherwise.