Monday, November 7, 2011

Quotes from network analysis papers

I'm currently doing a heavy literature research in network analysis and from time to time I find quotes that need some more audience.

Reciprocity

"We use the term reciprocity  to depict the degree of mutuality of a relationship. A re-
lationship with a high reciprocity is one where both are equally interested in keeping up
the relationship — a good example is the relationship of Romeo and Juliet in the famous
play with the same name by William Shakespeare, where the two lovers strive to share each
other’s company despite their families’ objections. On the other hand, in a relationship with
a low reciprocity one person is significantly more active in maintaining the relationship than
1the other. We judge reciprocity from actual communications taking place between people.

In Shakespeare’s time this would have meant counting the number of poems passed and
sonnets sung, but in our modern era it is easier to make use of the prevalence of electronic
communication." :-) [Lauri Kovanen, Jari Saramäki, Kimmo Kaski: "Reciprocity of mobile phone calls", ArXiv:1002.0763v1 [physics.soc-ph], p. 1-2]

Centrality Indices

Sabidussi about the introduction of new centrality indices:
"There is no doubt that an appeal to intuition must be made at some level, but it has to be made in a mathematically precise fashion." (Sabidussi, The centrality index of a graph, Psychometrika 31(4), 581-603, 1966)
So, how good is your mathematical precision on your intuition? ;-)

Sabidussi tried an axiomatic approach to centrality indices. He required for example that adding an edge to the most central vertex of a graph should always result in a graph in which the same vertex is still most central. Although this sounds intuitive, most centrality indices do not stand this test (e.g., eccentricity). Nonetheless, Sabidussi concludes:

"One may, of course, blame this wholesale failure on our axiom system.There is little doubt, however, that the three indices [which he tested and which failed as well, among them a closeness-type of centrality] [...] would not survive even a more sophisticated system of axioms. In view of this, we strongly suggest that [these measures] be discarded and that centrality be measured by the trivial index [1/(n - degree of the node)] defined [above]. [It] is more easily calculated than any of the other indices, and, whatever its intuitive shortcomings, it has the decided advantage of satisfying a well-defined system of axioms." Sabidussi, 1966 (p. 20, see above)

Citation Failures

As everyone knows and thankfully remarked by Goh, Kahn and Kim [Universal Load of Load Distribution in Scale-Free Networks, PRL, 87(27), 278701, 2001], Mark Newman introduced the BFS: "... and measure the load along the shortest path using the modified version of the breath-first  [sic!] search algorithm introduced by Newman [ M. E. J.  Newman,  Phys.  Rev.  E  64,  016131  (2001);  64, 016132 (2001)] "

Excuses and Justifications


"Edge weights in networks have, with some exceptions [...], received relatively little attention in
the physics literature for the excellent reason that in any field one is well advised to look at the simple cases first (unweighted networks) before moving on to more complex ones (weighted networks). " M.E.J. Newman, "Analysis of weighted networks", ArXiv:cond-mat/0407503v1

Excuse for not asking all participants of the study the same questions:
"We only asked the last two informants this question because it didn't occur to us earlier."  
 [Bernard, H. R.; Shelley, G. A. & Killworth, P.: "How much of a network does the GSS and RSW dredge up? Social Networks, 1987, 9, 49-61]
In general I have the feeling that scientific articles were more personal in these days. This particular article is started with the following quote:  "At my twenty-fifth high school reunion, last year, you would have been proud of me. They way i called those names up, with seldom a quick half glance at a tag ... their names came to me like the list of vowels, because I had learned them when i was fresh, back before I had met or heard of 375,000 other Americans. By the time anyone gets to be 43, if he has followed current events and been out of town a few times, two thirds of the names he hears sound vaguely, but only vaguely, familiar" (From "Not exactly what I had in Mind", Roy Blount Jr., The Atlantic Monthly Press, 1985)
I wish we would read more of the persons behind research. It would remind us all that science is quantifiable but still conducted by humans which err or make subjective decisions.


The following is a quote by two physicists about their engagement in economy, but the statement seems so general that it might fit to network analysis as well:
"Physicists are not, however, accustomed to waiting for a fully formed theory before reporting new results." Tobias Preis and H. Eugene Stanley: "Bubble trouble", Physics World May, p. 29-32, 2011.


Data quality: Protein-protein interaction networks

Mackay et al. wrote a paper about protein-interaction data and how bad they are. They have tried to re-produce around 20 published protein-protein interaction pairs but were only able to reproduce less than half. Their article starts with the following sentences:

"When Othello concluded, on the basis of flimsy evidence, that Desdemona had indulged in inappropriate physical interactions, great tragedy ensued. We believe that many reported protein interactions are similarly based on insufficient data, and therefore risk skewing the literature."
(Mackay, J. P.; Sunde, M.; Lowry, J. A.; Crossley, M. & Matthews, J. M. Protein interactions: is seeing believing? TRENDS in Biochemical Sciences, 2008, 32, 530-531)


I really wonder whether we want to base any type of network analysis where up to 50% of all edges are false-positive.

Copy-pasting story lines

  There are some story lines in network analysis I just have heard way too often and it seems they are just copy-pasted from one paper to the other. One of them is: "Until recently, network modeling often assumed the topology  was  either  a  random  graph  or  a  regular  lattice." In this case, that is a citation from Goldberg and Roth, "Assessing experimentally derived interactions in a small world", PNAS 100(8), p. 4372-76, 2003, but actually it can be found in slight variations in hundreds of papers. It is a very interesting topic in itself, how science builds narratives that convey their beliefs in a model but I think it is important to stop from time to time and re-think a story line. Often, the narrative becomes oversimplified by little "mutations" along their copy-and-paste evolution that essentially hinders science more than helps it. Coming back to the example: It is unreasonable that any mature scientist would actually assume that a real world network was either a random graph or a regular lattice. We have modeled it as a random graph or a regular lattice in the hope that these models capture the essential properties of it - or just because a random graph model and a lattice comes in handy when trying to calculate things. The origins of these two models, especially in complex network analysis, is that atomic interactions in cristal lattices could be analyzed and solved in exactly two models: a lattice structure, which comes natural for cristals, or in a random graph fashion - which is not a natural choice but one in which some insight into the model can be gained. Thus, in the paper of Watts and Strogatz, which both come from statistical physics in which these interactions are an important research question, they focused on these two models - they were prevalent in statistical physics and well understood there. It is thus reasonable to transfer these models to real-world interaction between other things than atoms just to see how well they do. This has nothing to do with actually assuming that the models capture the real-world interactions in any meaningful way.

Mathematical modeling (in general and for network analysis)

In his article Sampling in social networks, Rothenberg cites Rapoportas follows: "Mathematical modeling is a vehicle for absolutely rigorous reasoning and therein lies its advantage. A disadvantage of mathematical modeling is that it necessitates a drastic paring down of the details of the phenomena modeled...these simplifications...can impair or altogether destroy the model's pragmatic relevance."


To be continued...

Thursday, August 18, 2011

How to lie with statistics

I very much like to give a short course on 'how to lie with statistics' (Check out my slides). My experience is that most people feel very uncomfortable in interpreting statistics and applying statistical methods to their data. The problem seems to be that most of the young scientists seem to think they are the only ones that do not understand statistics. But it rather seems to be the case that humans in general are not very good at thinking in probabilities. Gigerenzer showed that the following question is very hard to solve correctly, even for medical doctors:

If a normal student donates some blood and an HIV test turns out positive, how likely is it that she is really infected? To answer that you need to know the sensitivity and specificity of the test, i.e., the probability that an infected person will be detected by the test and the probability that a non-infected person will have a negative test result. Both probabilities are very high, around 99.8%. Still, the probability that a student with a positive first test is really infected is only 40/2040 = 1/51. Astonished?

80% of the medical doctors he interviewed gave the wrong numbers, another 7% got close but presumably by some obvious error, and only the remaining ones gave the correct answer.
But Gigerenzer also showed very nicely how to remedy the problem: forget about probabilities and think in natural frequencies! Then you'll notice that you need another information: in the normal population how many people are infected per year and do not know it? According to the German Robert Koch Institute this incidence rate of new HIV infections is about 3,000 per year in the whole population. So, if 1,000,000 people (out of 80 million Germans) donate blood we can expect around 40 of them to be infected with HIV. The other close to 1,000,000 donors are not infected but 2 out of 1000 of them will be tested positive nonetheless. Thus, in total there will be 2000+40 positive tests of which only 40 really point to infected donors.


Also in network analysis there were some articles regarding statistics and problems with it. Most of them warned about wrong random models or at least warned that we need to think hard about the right random graph model:
  1. Artzy-Randrup et al. showed that network motifs might be evaluted differently depending on the underlying random graph model. 
  2. Colizza et al. show that also the rich-club effect needs to be assessed against the appropriate random network model: observing only those nodes with degree at least k,  the fraction of realized edges between these most connected nodes is called the rich-club coefficient. It was thought that a rich-club coefficient that increases with k is a sign of a rich-club effect in which those that are most connected build a dense core. Colizza et al show that even random graphs have this increasing rich-club coefficient since high-degree nodes just have a higher probability to be connected to anybody, also to other high-degree nodes. Thus, it is important to compare with a random graph model which maintains the degree sequence.
  3. We have shown a similar effect in assessing the similarity of two nodes in a bipartite graph. Say you have a bipartite graph between customers and films and make an edge between two films if the customer stated she liked that film. If now two films are co-liked by 23 persons, is this statistically significant? Or if they are co-liked by 1179 persons? By choosing the right random network model, it is easy to quantify the significance of this number. (The second pair of films was 'Pretty Woman' and 'Star Wars V' and you might have guessed that their number is significantly too low given their respective popularity). See the figure for more examples.
  4. Even modularity, which already checks the observed edges within partitions against the expected number of these edges in a graph, has some statistical flaws (or say, unintuitive properties) which were thankfully pointed out by Fortunato and Barthelémy.

Pairwise "similarity" of the most popular films in the Netflix data set (popular = rated by at least 30% of all customers). We have sorted the films such they fall in natural groups like action films in group 1 and chick flicks in group 5. For each pair we counted the number of customers that liked both films (rating 4 or 5 out of 5). Blue fields say that the pair of film has been co-liked more often than expected and red fields denote pairs of films that were significantly less co-liked than expected. Of course, expectation depends on the chosen random model. The classic random model (lower right) thinks that all popular films are more often liked together than expected. This includes the film pair: "50 first dates" and "The Patriot". If you also think that both films are simply wonderful, you can stop here. By using a different random model inspired from network analysis, a much more differentiated picture emerges (upper left). Here, it can be clearly seen that films from the same group are still more often co-liked than expected while different groups (e.g., 1 and 5) are distinctly less co-liked than expected.

I would like to know whether there is a similar trick as to thinking in frequencies rather than in probabilities to get the choice of the appropriate random graph model right. If you know one, let me know!

Feel free to use my slides for your own lecture on 'How to lie with statistics' (please give credit and note that all the images are under some GPL and so should be your modification of it). I also like to travel, so if I'm around and your interested in this short-course I would love to come to your institution and give the course. It takes three hours and is targeted at PhD students from non-mathematical studies in their first year.

Tuesday, August 9, 2011

Our Word-Morph Network

Do you remember the old word game in which you are asked to change the word 'cat' into 'dog' by exchanging one letter at a time, only using real English words? So, with cat and dog it is easier as it may seem: cat-cot-dot-dog. It is not always that easy: for ivy and dry the shortest path is: dry-dey-dee-dye-aye-ace-ice-icy-ivy. So, here, it is not possible to directly come 'closer' to the goal with each exchanged letter. The graph above shows all correct English three-letter words (extracted from the Oxford dictionary), where two words are connected by an edge if they differ in exactly one letter. So, this is the graph in which the word-morph game essentially takes place.

We asked ourselves how people learn to play this game: will they be slow at the beginning and get faster soon? Will they learn the shortest paths between all pairs or do something different? We found out that people learn 'landmark words' through which they tend to navigate. Thus, they only need to learn n many paths in a network with n nodes, instead of learning n*n different paths. Of course, the paths that navigate through a certain landmark word are a bit longer than the shortest paths (in this case by a factor of about 1.6). It thus seems that our human mind tries to optimize both: path length but also the effort to learn about the network. Check out our paper if you want to learn more about this research; we got a best paper award for it from CogSci 2011.

The graph was layouted in Gephi, and coloured by an automatic clustering algorithm. An obvious pattern emerged, consisting of five central groups of words (dark blue, green red, yellow, light blue). Can you guess what these five groups are?

Tuesday, July 26, 2011

Egonet visualization in igraph

Just for fun I thought I should implement the egonet visualization from yesterday in the igraph package as well (yesterday I used the network package). Merely 2 hours later I came up with the right recipe... :-) Basically, the fun was motivated by the not so overwhelming visualizations of my last blog. So, I had the idea to export the resulting ego-net-subgraph into some Gephi-readable format.In short, we want to go from the first figure  to the second figure. Yes, I'm sure it is really the same graph!!




I could not find any helpful write/export-function in the network package, but  there is a flexible write.graph-function in the igraph package. Both packages are quite similar, thus a quick transformation would do the trick. So, let's look at the necessary transformations:

tableEmail <- read.table("email_Arenas.txt")
emailNW <- graph.data.frame(tableEmail, directed=T)
randomSample <- sample(0:vcount(emailNW)-1, 10, replace=FALSE)
neighs <- vector()
for(x in randomSample){
 neighs <- c(neighs,neighborhood(emailNW, 1, x, mode="all")[[1]])
}
subgraph <- subgraph(emailNW, neighs)


The replacements are:
  1. graph.data.frame instead of network. It creates the igraph object out of the data frame.
  2. random sample from the interval [0:n-1] where n is the number of nodes, explained later.
  3. neighborhood instead of get.neighborhood. It makes life a bit easier since it naturally includes all nodes in distance smaller than the given order, in this case 1. Thus, we do not need to explicitly include x itself. Careful, the result is a list, so make sure to only append the first entry of the list to the vector neighs
  4. subgraph instead of get.inducedSubgraph.
The plot at this time point looks similar to the other one, all vertices are red. Now, the fun begins! How can we color the random seed nodes? In essence, what we did in the network package was to prepare a list of colors where we assigned a different color to the vertices from the random sample (as identified by their index). We then restricted this vector to those indices which are still present in the subgraph, as identified by the function network.vertex.names(subgraph). It can thus be seen that in the network package the induced subgraph keeps the old vertex IDs as vertex names:

color <- rep(2, times=network.size(emailNW))
color[randomSample] = 3
plot(subgraph, vertex.col=color[network.vertex.names(subgraph)])
However, igraph does not make it as simple for us. First of all, it re-assigns vertex IDs in the subgraph to make them subsequent. Second, these indices run from 0 to (number of nodes -1). This is already the case in the first igraph-object that we created, namely emailNW. This leads to the first surprising behavior. Let's look at the random sample:

> randomSample
> [1]  670   97  352  346  465   53   37 1092  726   74

Let's look at the vertices in the resulting induced subgraph with the function V(subgraph):
 
> V(subgraph)
 [1] "2"    "3"    "5"    "7"    "18"   "19"   "21"   "22"   "23"   "27"  
 [11] "31"   "38"   "40"   "41"   "45"   "49"   "51"   "54"   "69"   "72"  
 [21] "74"   "75"   "76"   "87"   "98"   "112"  "124"  "143"  "148"  "152" 
 [31] "183"  "185"  "187"  "189"  "191"  "195"  "231"  "233"  "237"  "241" 
 [41] "254"  "267"  "268"  "270"  "275"  "280"  "290"  "314"  "316"  "329" 
 [51] "330"  "331"  "333"  "344"  "345"  "346"  "347"  "348"  "349"  "350" 
 [61] "351"  "352"  "353"  "354"  "355"  "356"  "362"  "378"  "392"  "396" 
 [71] "454"  "462"  "463"  "464"  "465"  "466"  "467"  "468"  "501"  "523" 
 [81] "538"  "549"  "556"  "557"  "558"  "559"  "560"  "561"  "568"  "578" 
 [91] "590"  "598"  "627"  "635"  "636"  "638"  "671"  "680"  "711"  "727" 
[101] "743"  "746"  "748"  "765"  "778"  "836"  "841"  "940"  "941"  "942" 
[111] "943"  "944"  "945"  "946"  "954"  "1030" "1031" "1092" "1093"

There is not a single vertex from the random sample set but, suspiciously, for each of them there is a vertex with an ID increased by one. What happens is that igraph uses the random sample as indices to the vertices that are themselves labeled from 1 to n (number of nodes in the graph). I.e., if we address the first ten nodes of emailNW by [0:9] we will get vertices labeled 1 to 10:

> V(emailNW)[0:9]
 Vertex sequence:
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

But (you knew there would be a but, didn't you?) all other attributes assigned to the node set of the graph are indexed by 1 to n. For example, the character with the names (labels) of the vertices is indexed 1 to n:
 
> V(emailNW)[0]
 Vertex sequence:
 [1] "1"
> V(emailNW)$name[0]
 character(0)
> V(emailNW)$name[1:10]
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
> V(emailNW)[0:9]
 Vertex sequence:
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Beautiful. With this it is now a piece of cake to do the coloring in igraph as well:
 
color <- rep(2, times=vcount(emailNW))
color[randomSample] <- 3
V(subgraph)$color <- color[as.numeric(V(subgraph)$name)-1]
V(subgraph)$layout <- layout.fruchterman.reingold(subgraph)
plot(subgraph)

Now, the figure will have 10 green nodes as planned.


Yeah, I know, beautiful layout. That is WHY we need to export it to GEPHI and beautify it there.

Btw, the layout assignment will throw two warnings but please don't ask me why. I hope I will not have to enquire that as well... ;-)

You might want to check that the right nodes got the right color:
 
V(subgraph)$name[V(subgraph)$color==3]
 V(emailNW)[randomSample]

Gephi's version
So, here is the Gephi layout. Of course, a hairball is a hairball - but: did you see the isolated component beforehand?

Monday, July 25, 2011

Visualizing and Coloring of Induced Subgraphs

If the network is too large to visualize it as a whole, it might be a good idea to choose some nodes at random and to visualize their direct surroundings, also called their ego network. The ego network comprises the direct neighbors of a node and the relationship between these neighbors. In this example we will combine all edges between 10 randomly chosen nodes and their direct neighbors. The data set we are working on is freely available from Alex Arena's homepage. We use his data on the email contact network of the University Rovira y Virgili and we will assume that it is stored as 'email_Arenas.txt'. (For all readers who are fluent in latex, here is the according Sweave-file which will create all figures in eps/pdf/png format. It assumes you have set the working directory to where itself and the data is located. )


> library(network)
> setwd("path-to-wherever-you-stored-the-data")
> tableEmail <- read.table("email_Arenas.txt") 

# create a network from the data frame
> emailNW <- network(tableEmail, directed = T)

# choose 10 random vertex IDs without replacement
> randomSample <- sample(1:network.size(emailNW), 10, replace = FALSE) 

# show the randomly drawn vertex IDs
> randomSample
[1]  843  548  691  871  524  858  290 1059  485  593

#initialize a new vector
> neighs <- vector() 

#for all vertex IDs in the random sample
> for (x in randomSample) {

# add themselves and their direct neighborhood to the vector 'neighs'
>  neighs <- c(neighs, x, get.neighborhood(emailNW, x, type = "combined")) 
> } 

#create the induced subgraph
> igraph <- get.inducedSubgraph(emailNW, neighs) 
Now, we would like to see this random sample from the graph:
> plot(igraph)

An induced subgraph based on 10 randomly drawn seed nodes and their direct neighborhoods.



The induced subgraphs gives an overall impression on how dense the local neighborhoods and the connections between 10 randomly drawn neighborhoods are. It would, however, be helpful to identify the 10 seeds. This can be done by coloring them accordingly. This was my first try:
#make a vector of size n (=#number of nodes), assigning color 2 as default
> color <- rep(2, times = network.size(emailNW))
#for the seed nodes, assign color number 3 
> color[network.vertex.names(emailNW) %in% randomSample] = 3
#plot the graph and assign the color vector
> plot(igraph, vertex.col = color[network.vertex.names(igraph)])

Interestingly, this does not give the wanted results; it seems as if there were no seed vertices at all:


A test plot in which all seed nodes should have been colored in green. Did not work, obviously.

If you would run the same code again and again, you would sometimes see one or two or even more green vertices. So, what happens (or seems to happen) is the following: the induced subgraph creates a new graph with obviously fewer vertices than the original graph. The vertex names themselves are maintained, and their order is maintained as well - you can check that by typing network.vertex.names(igraph). But only the first n entries in vertex.col are used as an assigment to the colors. Thus, we need to reduce the color-vector to those entries which are contained in igraph:

> plot(igraph ,vertex.col=color[network.vertex.names(igraph)])

This will finally produce the graph with all 10 seed nodes colored in green:


The final result with all 10 seed nodes colored in green.

Saturday, July 23, 2011

Importing Data into the network Package in R (II)

Okay, let's proceed with the last example to find another interesting behavior of the network and igraph packages in R. Again, we will assume you downloaded the autonomous system network as compiled by Jure Leskovec, named it "edges.txt" and saved it into your favourite directory "favDir". We will now import it:

> setwd("path-to-favDir")
> library(network)
> myEdges <- read.table("edges.txt", header=T)
We now look at the lines 500 to 600 in the table and create a network from these lines:
> myEdges[500:600,]
> network <- network(myEdges[500:600,])
In this part of the table, there are exactly 101 edges, all with the same source node with ID 1 and 101 other nodes. Thus, we would expect that the resulting network has 102 nodes. So, let's check:
> network.size(network)
> 568
So, essentially what happens is that the network()-function creates a node for each ID from 1 until the maximum ID of 568. This is a (bug or feature?) side-effect of the IDs being numeric, i.e., of type int in the data.frame myEdges. If you again tell R to read in the nodes' IDs as characters, the resulting network will have the expected 102 nodes:
> myEdges <- read.table("edges.txt", header=T, colClasses=c("character", "character"))
> network2 <- network(myEdges[500:600,])
> network.size(network2)
> 102

So, if you ever experience long transformation times from data.frame to network, check whether your IDs have a numeric type and whether you have IDs which are much higher than the number of nodes in the network.

Importing Data into the network Package in R (I)

Importing network data into R seems to be so simple. We will assume that you have an edge list in a data file called
edges.txt
in your favourite directory
favDir

It could, e.g., be a data set like the connections of autonomous systems in January 2000, as compiled by Jure Leskovec of which we show the first lines:

# Undirected graph: as20000102.txt
# Autonomous Systems (from the BGP logs) from January 02 2000
# Nodes: 6474 Edges: 13233
 FromNodeId ToNodeId
0 1
0 2
0 3
0 4
0 5
0 6

The first three lines are comments, as indicated by the hash symbol. We have removed the hash symbol from the fourth line as this contains the names of the columns. This data can easily be read into R:

> library(network)
> library(sna)
> setwd("path-to-favDir")

> myEdges <- read.table("edges.txt")
> network <- network(myEdges)
But something is strange, if we look at the number of nodes in the network:
> network.size(network)
> 6476
But this does not match with Jure's statistics: he says that the network has only 6474 nodes! So, let's look at the nodes' names in the network:
> tail(network.vertex.names(network))
> "996"        "997"        "998"        "999"        "FromNodeId" "ToNodeId"  
With
tail
we only see the last six entries of all vertex names. But we can see, what happened: the header information "FromNodeId" and "ToNodeId" was interpreted as an edge between two nodes named "FromNodeId" and "ToNodeId". This can be remedied by adding
header=T
to the read.table function:
> myEdges2 <- read.table("edges.txt", header=T)
> network2 <- network(myEdges)
But now we get a very strange error message which is not particularly helpful:
Error in if (matrix.type == "edgelist") { : 
  missing value where TRUE/FALSE needed
By reducing the data set a bit, you would find that nothing happens as long as you do not have a node with ID 0. But why was that not a problem beforehand? R makes some implicit assumptions when reading in data files. In the first go, the first real line of the file was interpreted as an edge between two nodes with names "FromNodeId" and "ToNodeId". Since these 'names' consisted of letters, R imported all subsequent IDs as so-called "factors", i.e., observations of different categories, where the names are interpreted as the possible categories.. This can be seen by looking at the structure of myEdges:
> str(myEdges)
> 'data.frame':   13896 obs. of  2 variables:
> $ V1: Factor w/ 2257 levels "0","1","10","100",..: 2257 1 1 1 1 1 1 1 1 1 ...
> $ V2: Factor w/ 6431 levels "1","10","100",..: 6431 1 1109 2217 3318 4419 5524 6103 6214 6322 ...
If this is transformed into a network, the "category names" are imported as strings of letters. But the new import explicitly said that the first line contains header information, so the first 'real' line is recognized as numbers by R. Let's look at the structure of myEdges2:
> str(myEdges2)
> 'data.frame':   13895 obs. of  2 variables:
> $ FromNodeId: int  0 0 0 0 0 0 0 0 0 0 ...
> $ ToNodeId  : int  1 2 3 4 5 6 7 8 9 10 ...
So, essentially, R was clever to recognize the right type of the IDs (namely int). If we now try to transform this data.frame into a network, the network-package kicks in: it just doesn't like a numerical ID equal to 0. There are now two remedies:
> myEdges3 <- read.table("edges.txt", header=T)
> myEdges3$FromNodeId <- myEdges3$FromNodeId+1
> myEdges3$ToNodeId <- myEdges3$ToNodeId+1
> network3 <- network(myEdges3)
> network.size(network3)
With the second and third line we increased all IDs by one and re-assigned these values to the two columns. Now, the network transformation proceeds without error and the number of nodes is correct with 6474. The other approach is to tell R explicitly to read in the IDs as "names", i.e., strings of letters:
> myEdges4 <- read.table("edges.txt", header=T, colClasses=c("character", "character"))
> network4 <- network(myEdges4)
> network.size(network4)

Now we enforced the import of the nodes' IDs as characters, and as a name the network package does not care about the 0.

Both approaches have their problems: in the first, you change the IDs and if you have additional node attributes identified by the ID, you need to take care to make them match. Similarly, if the ID is imported as a character you might have more problems to match (numeric) IDs with the character versions of themselves.

Saturday, June 18, 2011

Arts, Humanities, and Complex Networks at NETSCI 2011 – Fueling my scientific batteries



Due to the generosity of various sponsors, the AHCNworkshop  did not have a registration fee – but since the location was limited in size, participants were asked to  register for a cost-free ticket. At the evening before the workshop I found out that I was 83rd on the waiting list for such a ticket! Nonetheless, I just went there next morning and was lucky that some of the registered participants had not shown up. But boy, they really missed something!

I had very high expectations towards this workshop since a similar one at the ECCS’10 (also organized by Maximilian Schich) had already been splendid – but all my expectations were met and even exceeded. The workshop took place in the very modern building of the Ludwig Muzeum, directly located at the Danube. Everything was splendidly and most generously organized: no fee, but an extraordinary good caterer with plenty of coffee and pogacsa supply (small Hungarian bread), and the most inspiring talks I heard on the whole conference. Do I sound enthusiastic? I definitely am! The broad field of topics, the enthusiasm of the many people which just started to explore the possibilities of network analysis for their field was just energizing.  I’ll give you an overview of the topics and some of the (subjectively) most interesting points.

The first keyword was given by Marek Claassen, on how to automatically score artists. Of course, there is the obvious way by simply summing up the money paid for their works, but Claassen proposed a more detailed model in which fame and recognition are quantified. The main idea is that an exhibition, especially if it is a solo exhibition at a renowned institute, gives more attention to the user and that this is like a currency. Of course, such a talk rises a lot of questions and even emotions  but it was an interesting start into a workshop which focuses exactly on this frontier: are we able to explore human culture by quantifying stochastical traces in our digital, virtual world?

Next, Tom Brughman showed his enthusiasm for using network analysis in archeology: he showed how he links various sites by the types of artifacts found at these sites. This research is still young and ít is not yet clear what kind of questions can be answered with it – which makes it all the more interesting for network analysists. 

An absolute highlight was the keynote talk by Jean BaptisteMichel about Culturomics  [1][2]. So, he and his co-authors wondered: Hey, within 10 years we can either read a few books very concentrated---or we could read 5 million books very superficial – let’s find out how much we can do with the latter! And they started to use 5 million digitized books from Google books which amounts to approximately 4% of all books ever published. In these books they looked for the probability that an irregular word becomes regularized, the time it takes until a new invention like the car, telephone or washing machines makes it into a book, or which profession is the most percepted and the one most persistently found in books. The result of the latter is clear: if you’re looking for fame you need to become a politician, but never a mathematician! I was very much astonished how far their approach took them – amazing talk. I’m sure we will hear more of him and his co-authors.  

Even after this really great talk, the next invited speaker Natalie Henry Riche had absolutely no problem to switch our focus to a totally different but equally fascinating topic: visualization of large complex networks. She has developed different software tools which all look very interesting:  in her PhD she tried out various ways to display graphs as adjacency matrices or to integrate adjacency matrices (e.g., of dense groups) with a ‘normal’ 2D-embedding of the graph. Within the adjacency matrix approach she, e.g., had the idea to fold parts of it away so that you can compare distant rows or columns with each other without losing track of what belongs where. Quite cool idea! Her homepage does not directly link to downloads of her software but she mentioned that by request there might be a way to get it. 

The next talk that really grabbed my attention was by RobinWilkins  which presented her and her co-authors' work on how the brain reacts to different types of music . For all persons in the experiment, the experimenter knew which song they loved most and which type of music they liked in general. The scientists put together a long record of all songs plus the favorite song of the respective test person. Looking at the working brain it became clear that the brain reacts totally different to songs it likes than to those it does not understand (music from a different culture) or those that it does not like. Especially, Wilkins et al. looked at how well the different parts of the brain, e.g., those doing the hearing or those concerned with emotions and memory, were synchronized during the different songs.

In summary: all of the talks were full of enthusiasm for bringing together arts, the humanities, and network analysis. It was just very refreshing to feel this enthusiasm, and I’ll make sure next time to register for Max’ great workshops at the earliest. So, please make sure you keep up the good work, Max!

Thursday, June 9, 2011

Circuits of profits – a satellite event of NetSci 2011

Maven7 managed to create an open atmosphere, attracting a good mixture of business managers, business consultants, and scientists in the well-equipped building of the Central European University. The density of ties and suits was much higher than in a normal physics or computer science conference – which also guaranteed better coffee and cake selections: the caterer was from the famous Gundel restaurant.

The keynote speech was given by Albert-László Barabási who summarized different aspects of the evolution of network analysis, which might be most influential for economy: he took us on a time travel, beaming us through the preferential attachment model and its close cousin the fitness model. He pointed out clear connection to markets: it is not always the first move into a market that will make you big; if you have a genuinely new function to provide, you can still win the market even if you come later. It is known that this model can lead to a Bose-Einstein-condensate if one of the nodes has an extremely high fitness, i.e., even a late-comer can grab a big share of all subsequent customers: In the evolution of such a network this will turn into a constant percentage of edges attached to this one node while in the normal preferential attachment model, the maximal degree will rather have a shrinking relative share of all edges. Barabási showed that essentially the operating system market might be of this fitness type since Microsoft has had a market share of 85% for the last two decades. I’m not sure whether this model holds since this network is fundamentally different from the preferential attachment model: most nodes in the operating system network are normal users, not companies producing and selling operating systems. Thus, a new node is much more likely to be a user and she can only choose from those producers that are already in the market. It would certainly be interesting to test whether the model applies to this slightly different situation.  He also briefly mentioned the link community algorithm  by Sune Lehmann, which, in my personal view, is one of the best ideas about clustering in networks in years. The link points to the enormous supplemental material presented along with the article.


The next invited speaker was Hamid Benbrahim who made a very interesting point: he introduced the “sigma-delta-disconnect”, namely that we either know much about aggregates (sums, represented by the sigma) or about the difference between things (represented by delta). In essence, he pronounced that there is a gap in our understanding of the macro and the micro level of our complex financial markets. He also pointed out that due to the new technology our markets have now synchronized much more than 10-20 years ago because any small leverage between markets is now easily identified and can be harnessed within milliseconds – virtually at the speed of light. He also showed a case in which two software agents created a small crash on the price of some stock because the first one wanted to sell a huge chunk of it. To not decrease the price, the agent is of course smart enough to offer it in small portions – however, the second agent is programmed in detecting behavior like this and guesses that there will be a large chunk sold and bets against the price. This leads to the predicted decrease of the price until finally the process was stopped. After a five seconds break, other agents realized that the price of the stock was undervalued and started buying it and after around an hour the price was back to normal. This shows how the modern trading system can fall into positive feedback loops that leverage the system out of its equilibrium position.


Alberto Calero made an interesting remark that  the Gartner report on Ten Technology Trends 2011 contained three key technologies related to network analysis: Social communication and collaboration, social analytics, and next generation analytics. He also shared his experience on how to convince fellow CEOs that network analysis can help in business decisions: he did not really disclose the details but it became clear that network analysis was an eminent part in advertising the new services made available by mobile phones in the 90s like SMS. He also reported on a new campaign in which customers are rewarded in groups, and he emphasized that attracting groups of people to a mobil phone provider might become much more important than attracting single customers. This was of course also a big topic in the later talks that reported on different churn models and the strategies to prevent it.


Valdis Krebs finally spoke on how to communicate network analysis results. He reported from his experience with various case studies and emphasized that we need new measures and new ways to report them. One of his customers once showed him a simple chart with a curve: the curve goes up – everything is fine. The curve stalls: watch out! The curve goes down – not good at all. Curve goes up again: you’re back into business. He asked whether it was possible to turn network analytic results into something as simple as that. So, a first step in this seems to be to replace all occurrences of ‘social network analysis’ by ‘organization network analysis’. Valdis’ answer to that request is, e.g., not to show networks, but to show a Venn diagram representation of an embedded and clustered network instead since that is more digestible. He also emphasized to develop a standardized procedure, use only a few standardized programs, and allow for the transfer of best practice. In general his advice is: make it simple.

This last point basically was the outcome of many of the talks later as well: do not make a research project out of every case study but communicate your results in a simple and short manner. In the round table discussion, the invited speakers agreed (make that ‘round chair discussion’) that we need to follow a schematized, standardized way of doing a network analysis report on economical networks – be it communication, trust, or inspiration networks. Maven7 is helping in that by promoting their first software tool, aiming at consultants that want to include network analysis to their repertoire. A second main point was that maybe we focus too much on single nodes and their position in a network. Helping a company should not boil down to “try to connect these three people more” but rather in creating an atmosphere in which positive network structures can emerge.