Friday, May 1, 2015

Impressum

This blog is written by

Prof. Dr. Katharina A. Zweig

Gottlieb-Daimler-Str. 48
Department of Computer Science
67663 Kaiserslautern

Telephone number: +49 631 205 3346
Email address: lastnameOfBlogger AT cs DOT uni-kl.de


The opinions reported in my blogs are my own opinions and not the ones of the university I am working. All scientific results reported here are carefully researched, however, there is always the possibility of mistakes and misunderstandings from my side. Please make up your own mind (as always).

Saturday, January 4, 2014

First chapter of my upcoming book is online

As some of you know, I am working on my book Network Analysis Literacy to get it finished as soon as possible. Holiday was a good time to get the first chapter polished. Have a nice read and come back to me with any comments you may have.

Thursday, January 2, 2014

Data sets for network analysis

Sunday, February 5, 2012

A bibtex-batch script

I currently write my book in Texlipse, a plug-in for Eclipse. I organized it in several \input-files with a separate chapter bib for all of the chapters. I could not find any help within Texlipse how to enforce a bibtex-run on each single input file, so the following batch script can help. Open a text editor like the WordPad or NotePad++ and copy the following line:

for /f %%a IN ('dir /b *.aux') do bibtex %%a

Save the file, e.g., as bibtexBatch.BAT. For convenience, save it to the same directory in which your .aux-files are located.
Then run the 'CMD' command to get a DOS shell and change to the directory in which your .aux files are located. You run the batch file by calling it directly: bibtexBatch.BAT

It cycles through the current directory and returns all files with the .aux-extension and hands them over to bibtex.

Monday, November 7, 2011

Quotes from network analysis papers

I'm currently doing a heavy literature research in network analysis and from time to time I find quotes that need some more audience.

Reciprocity

"We use the term reciprocity  to depict the degree of mutuality of a relationship. A re-
lationship with a high reciprocity is one where both are equally interested in keeping up
the relationship — a good example is the relationship of Romeo and Juliet in the famous
play with the same name by William Shakespeare, where the two lovers strive to share each
other’s company despite their families’ objections. On the other hand, in a relationship with
a low reciprocity one person is significantly more active in maintaining the relationship than
1the other. We judge reciprocity from actual communications taking place between people.

In Shakespeare’s time this would have meant counting the number of poems passed and
sonnets sung, but in our modern era it is easier to make use of the prevalence of electronic
communication." :-) [Lauri Kovanen, Jari Saramäki, Kimmo Kaski: "Reciprocity of mobile phone calls", ArXiv:1002.0763v1 [physics.soc-ph], p. 1-2]

Centrality Indices

Sabidussi about the introduction of new centrality indices:
"There is no doubt that an appeal to intuition must be made at some level, but it has to be made in a mathematically precise fashion." (Sabidussi, The centrality index of a graph, Psychometrika 31(4), 581-603, 1966)
So, how good is your mathematical precision on your intuition? ;-)

Sabidussi tried an axiomatic approach to centrality indices. He required for example that adding an edge to the most central vertex of a graph should always result in a graph in which the same vertex is still most central. Although this sounds intuitive, most centrality indices do not stand this test (e.g., eccentricity). Nonetheless, Sabidussi concludes:

"One may, of course, blame this wholesale failure on our axiom system.There is little doubt, however, that the three indices [which he tested and which failed as well, among them a closeness-type of centrality] [...] would not survive even a more sophisticated system of axioms. In view of this, we strongly suggest that [these measures] be discarded and that centrality be measured by the trivial index [1/(n - degree of the node)] defined [above]. [It] is more easily calculated than any of the other indices, and, whatever its intuitive shortcomings, it has the decided advantage of satisfying a well-defined system of axioms." Sabidussi, 1966 (p. 20, see above)

Citation Failures

As everyone knows and thankfully remarked by Goh, Kahn and Kim [Universal Load of Load Distribution in Scale-Free Networks, PRL, 87(27), 278701, 2001], Mark Newman introduced the BFS: "... and measure the load along the shortest path using the modified version of the breath-first  [sic!] search algorithm introduced by Newman [ M. E. J.  Newman,  Phys.  Rev.  E  64,  016131  (2001);  64, 016132 (2001)] "

Excuses and Justifications


"Edge weights in networks have, with some exceptions [...], received relatively little attention in
the physics literature for the excellent reason that in any field one is well advised to look at the simple cases first (unweighted networks) before moving on to more complex ones (weighted networks). " M.E.J. Newman, "Analysis of weighted networks", ArXiv:cond-mat/0407503v1

Excuse for not asking all participants of the study the same questions:
"We only asked the last two informants this question because it didn't occur to us earlier."  
 [Bernard, H. R.; Shelley, G. A. & Killworth, P.: "How much of a network does the GSS and RSW dredge up? Social Networks, 1987, 9, 49-61]
In general I have the feeling that scientific articles were more personal in these days. This particular article is started with the following quote:  "At my twenty-fifth high school reunion, last year, you would have been proud of me. They way i called those names up, with seldom a quick half glance at a tag ... their names came to me like the list of vowels, because I had learned them when i was fresh, back before I had met or heard of 375,000 other Americans. By the time anyone gets to be 43, if he has followed current events and been out of town a few times, two thirds of the names he hears sound vaguely, but only vaguely, familiar" (From "Not exactly what I had in Mind", Roy Blount Jr., The Atlantic Monthly Press, 1985)
I wish we would read more of the persons behind research. It would remind us all that science is quantifiable but still conducted by humans which err or make subjective decisions.


The following is a quote by two physicists about their engagement in economy, but the statement seems so general that it might fit to network analysis as well:
"Physicists are not, however, accustomed to waiting for a fully formed theory before reporting new results." Tobias Preis and H. Eugene Stanley: "Bubble trouble", Physics World May, p. 29-32, 2011.


Data quality: Protein-protein interaction networks

Mackay et al. wrote a paper about protein-interaction data and how bad they are. They have tried to re-produce around 20 published protein-protein interaction pairs but were only able to reproduce less than half. Their article starts with the following sentences:

"When Othello concluded, on the basis of flimsy evidence, that Desdemona had indulged in inappropriate physical interactions, great tragedy ensued. We believe that many reported protein interactions are similarly based on insufficient data, and therefore risk skewing the literature."
(Mackay, J. P.; Sunde, M.; Lowry, J. A.; Crossley, M. & Matthews, J. M. Protein interactions: is seeing believing? TRENDS in Biochemical Sciences, 2008, 32, 530-531)


I really wonder whether we want to base any type of network analysis where up to 50% of all edges are false-positive.

Copy-pasting story lines

  There are some story lines in network analysis I just have heard way too often and it seems they are just copy-pasted from one paper to the other. One of them is: "Until recently, network modeling often assumed the topology  was  either  a  random  graph  or  a  regular  lattice." In this case, that is a citation from Goldberg and Roth, "Assessing experimentally derived interactions in a small world", PNAS 100(8), p. 4372-76, 2003, but actually it can be found in slight variations in hundreds of papers. It is a very interesting topic in itself, how science builds narratives that convey their beliefs in a model but I think it is important to stop from time to time and re-think a story line. Often, the narrative becomes oversimplified by little "mutations" along their copy-and-paste evolution that essentially hinders science more than helps it. Coming back to the example: It is unreasonable that any mature scientist would actually assume that a real world network was either a random graph or a regular lattice. We have modeled it as a random graph or a regular lattice in the hope that these models capture the essential properties of it - or just because a random graph model and a lattice comes in handy when trying to calculate things. The origins of these two models, especially in complex network analysis, is that atomic interactions in cristal lattices could be analyzed and solved in exactly two models: a lattice structure, which comes natural for cristals, or in a random graph fashion - which is not a natural choice but one in which some insight into the model can be gained. Thus, in the paper of Watts and Strogatz, which both come from statistical physics in which these interactions are an important research question, they focused on these two models - they were prevalent in statistical physics and well understood there. It is thus reasonable to transfer these models to real-world interaction between other things than atoms just to see how well they do. This has nothing to do with actually assuming that the models capture the real-world interactions in any meaningful way.

Mathematical modeling (in general and for network analysis)

In his article Sampling in social networks, Rothenberg cites Rapoportas follows: "Mathematical modeling is a vehicle for absolutely rigorous reasoning and therein lies its advantage. A disadvantage of mathematical modeling is that it necessitates a drastic paring down of the details of the phenomena modeled...these simplifications...can impair or altogether destroy the model's pragmatic relevance."


To be continued...

Thursday, August 18, 2011

How to lie with statistics

I very much like to give a short course on 'how to lie with statistics' (Check out my slides). My experience is that most people feel very uncomfortable in interpreting statistics and applying statistical methods to their data. The problem seems to be that most of the young scientists seem to think they are the only ones that do not understand statistics. But it rather seems to be the case that humans in general are not very good at thinking in probabilities. Gigerenzer showed that the following question is very hard to solve correctly, even for medical doctors:

If a normal student donates some blood and an HIV test turns out positive, how likely is it that she is really infected? To answer that you need to know the sensitivity and specificity of the test, i.e., the probability that an infected person will be detected by the test and the probability that a non-infected person will have a negative test result. Both probabilities are very high, around 99.8%. Still, the probability that a student with a positive first test is really infected is only 40/2040 = 1/51. Astonished?

80% of the medical doctors he interviewed gave the wrong numbers, another 7% got close but presumably by some obvious error, and only the remaining ones gave the correct answer.
But Gigerenzer also showed very nicely how to remedy the problem: forget about probabilities and think in natural frequencies! Then you'll notice that you need another information: in the normal population how many people are infected per year and do not know it? According to the German Robert Koch Institute this incidence rate of new HIV infections is about 3,000 per year in the whole population. So, if 1,000,000 people (out of 80 million Germans) donate blood we can expect around 40 of them to be infected with HIV. The other close to 1,000,000 donors are not infected but 2 out of 1000 of them will be tested positive nonetheless. Thus, in total there will be 2000+40 positive tests of which only 40 really point to infected donors.


Also in network analysis there were some articles regarding statistics and problems with it. Most of them warned about wrong random models or at least warned that we need to think hard about the right random graph model:
  1. Artzy-Randrup et al. showed that network motifs might be evaluted differently depending on the underlying random graph model. 
  2. Colizza et al. show that also the rich-club effect needs to be assessed against the appropriate random network model: observing only those nodes with degree at least k,  the fraction of realized edges between these most connected nodes is called the rich-club coefficient. It was thought that a rich-club coefficient that increases with k is a sign of a rich-club effect in which those that are most connected build a dense core. Colizza et al show that even random graphs have this increasing rich-club coefficient since high-degree nodes just have a higher probability to be connected to anybody, also to other high-degree nodes. Thus, it is important to compare with a random graph model which maintains the degree sequence.
  3. We have shown a similar effect in assessing the similarity of two nodes in a bipartite graph. Say you have a bipartite graph between customers and films and make an edge between two films if the customer stated she liked that film. If now two films are co-liked by 23 persons, is this statistically significant? Or if they are co-liked by 1179 persons? By choosing the right random network model, it is easy to quantify the significance of this number. (The second pair of films was 'Pretty Woman' and 'Star Wars V' and you might have guessed that their number is significantly too low given their respective popularity). See the figure for more examples.
  4. Even modularity, which already checks the observed edges within partitions against the expected number of these edges in a graph, has some statistical flaws (or say, unintuitive properties) which were thankfully pointed out by Fortunato and Barthelémy.

Pairwise "similarity" of the most popular films in the Netflix data set (popular = rated by at least 30% of all customers). We have sorted the films such they fall in natural groups like action films in group 1 and chick flicks in group 5. For each pair we counted the number of customers that liked both films (rating 4 or 5 out of 5). Blue fields say that the pair of film has been co-liked more often than expected and red fields denote pairs of films that were significantly less co-liked than expected. Of course, expectation depends on the chosen random model. The classic random model (lower right) thinks that all popular films are more often liked together than expected. This includes the film pair: "50 first dates" and "The Patriot". If you also think that both films are simply wonderful, you can stop here. By using a different random model inspired from network analysis, a much more differentiated picture emerges (upper left). Here, it can be clearly seen that films from the same group are still more often co-liked than expected while different groups (e.g., 1 and 5) are distinctly less co-liked than expected.

I would like to know whether there is a similar trick as to thinking in frequencies rather than in probabilities to get the choice of the appropriate random graph model right. If you know one, let me know!

Feel free to use my slides for your own lecture on 'How to lie with statistics' (please give credit and note that all the images are under some GPL and so should be your modification of it). I also like to travel, so if I'm around and your interested in this short-course I would love to come to your institution and give the course. It takes three hours and is targeted at PhD students from non-mathematical studies in their first year.

Tuesday, August 9, 2011

Our Word-Morph Network

Do you remember the old word game in which you are asked to change the word 'cat' into 'dog' by exchanging one letter at a time, only using real English words? So, with cat and dog it is easier as it may seem: cat-cot-dot-dog. It is not always that easy: for ivy and dry the shortest path is: dry-dey-dee-dye-aye-ace-ice-icy-ivy. So, here, it is not possible to directly come 'closer' to the goal with each exchanged letter. The graph above shows all correct English three-letter words (extracted from the Oxford dictionary), where two words are connected by an edge if they differ in exactly one letter. So, this is the graph in which the word-morph game essentially takes place.

We asked ourselves how people learn to play this game: will they be slow at the beginning and get faster soon? Will they learn the shortest paths between all pairs or do something different? We found out that people learn 'landmark words' through which they tend to navigate. Thus, they only need to learn n many paths in a network with n nodes, instead of learning n*n different paths. Of course, the paths that navigate through a certain landmark word are a bit longer than the shortest paths (in this case by a factor of about 1.6). It thus seems that our human mind tries to optimize both: path length but also the effort to learn about the network. Check out our paper if you want to learn more about this research; we got a best paper award for it from CogSci 2011.

The graph was layouted in Gephi, and coloured by an automatic clustering algorithm. An obvious pattern emerged, consisting of five central groups of words (dark blue, green red, yellow, light blue). Can you guess what these five groups are?