## Tuesday, July 26, 2011

### Egonet visualization in igraph

Just for fun I thought I should implement the egonet visualization from yesterday in the igraph package as well (yesterday I used the network package). Merely 2 hours later I came up with the right recipe... :-) Basically, the fun was motivated by the not so overwhelming visualizations of my last blog. So, I had the idea to export the resulting ego-net-subgraph into some Gephi-readable format.In short, we want to go from the first figure  to the second figure. Yes, I'm sure it is really the same graph!!

I could not find any helpful write/export-function in the network package, but  there is a flexible write.graph-function in the igraph package. Both packages are quite similar, thus a quick transformation would do the trick. So, let's look at the necessary transformations:

```tableEmail <- read.table("email_Arenas.txt")
emailNW <- graph.data.frame(tableEmail, directed=T)
randomSample <- sample(0:vcount(emailNW)-1, 10, replace=FALSE)
neighs <- vector()
for(x in randomSample){
neighs <- c(neighs,neighborhood(emailNW, 1, x, mode="all")[[1]])
}
subgraph <- subgraph(emailNW, neighs)
```

The replacements are:
1. graph.data.frame instead of network. It creates the igraph object out of the data frame.
2. random sample from the interval [0:n-1] where n is the number of nodes, explained later.
3. neighborhood instead of get.neighborhood. It makes life a bit easier since it naturally includes all nodes in distance smaller than the given order, in this case 1. Thus, we do not need to explicitly include x itself. Careful, the result is a list, so make sure to only append the first entry of the list to the vector neighs
The plot at this time point looks similar to the other one, all vertices are red. Now, the fun begins! How can we color the random seed nodes? In essence, what we did in the network package was to prepare a list of colors where we assigned a different color to the vertices from the random sample (as identified by their index). We then restricted this vector to those indices which are still present in the subgraph, as identified by the function network.vertex.names(subgraph). It can thus be seen that in the network package the induced subgraph keeps the old vertex IDs as vertex names:

```color <- rep(2, times=network.size(emailNW))
color[randomSample] = 3
plot(subgraph, vertex.col=color[network.vertex.names(subgraph)])```
However, igraph does not make it as simple for us. First of all, it re-assigns vertex IDs in the subgraph to make them subsequent. Second, these indices run from 0 to (number of nodes -1). This is already the case in the first igraph-object that we created, namely emailNW. This leads to the first surprising behavior. Let's look at the random sample:

```> randomSample
> [1]  670   97  352  346  465   53   37 1092  726   74
```

Let's look at the vertices in the resulting induced subgraph with the function V(subgraph):
` `
```> V(subgraph)
[1] "2"    "3"    "5"    "7"    "18"   "19"   "21"   "22"   "23"   "27"
[11] "31"   "38"   "40"   "41"   "45"   "49"   "51"   "54"   "69"   "72"
[21] "74"   "75"   "76"   "87"   "98"   "112"  "124"  "143"  "148"  "152"
[31] "183"  "185"  "187"  "189"  "191"  "195"  "231"  "233"  "237"  "241"
[41] "254"  "267"  "268"  "270"  "275"  "280"  "290"  "314"  "316"  "329"
[51] "330"  "331"  "333"  "344"  "345"  "346"  "347"  "348"  "349"  "350"
[61] "351"  "352"  "353"  "354"  "355"  "356"  "362"  "378"  "392"  "396"
[71] "454"  "462"  "463"  "464"  "465"  "466"  "467"  "468"  "501"  "523"
[81] "538"  "549"  "556"  "557"  "558"  "559"  "560"  "561"  "568"  "578"
[91] "590"  "598"  "627"  "635"  "636"  "638"  "671"  "680"  "711"  "727"
[101] "743"  "746"  "748"  "765"  "778"  "836"  "841"  "940"  "941"  "942"
[111] "943"  "944"  "945"  "946"  "954"  "1030" "1031" "1092" "1093"
```

There is not a single vertex from the random sample set but, suspiciously, for each of them there is a vertex with an ID increased by one. What happens is that igraph uses the random sample as indices to the vertices that are themselves labeled from 1 to n (number of nodes in the graph). I.e., if we address the first ten nodes of emailNW by [0:9] we will get vertices labeled 1 to 10:

```> V(emailNW)[0:9]
Vertex sequence:
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
```

But (you knew there would be a but, didn't you?) all other attributes assigned to the node set of the graph are indexed by 1 to n. For example, the character with the names (labels) of the vertices is indexed 1 to n:
` `
```> V(emailNW)[0]
Vertex sequence:
[1] "1"
> V(emailNW)\$name[0]
character(0)
> V(emailNW)\$name[1:10]
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
> V(emailNW)[0:9]
Vertex sequence:
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
```

Beautiful. With this it is now a piece of cake to do the coloring in igraph as well:
` `
```color <- rep(2, times=vcount(emailNW))
color[randomSample] <- 3
V(subgraph)\$color <- color[as.numeric(V(subgraph)\$name)-1]
V(subgraph)\$layout <- layout.fruchterman.reingold(subgraph)
plot(subgraph)
```

Now, the figure will have 10 green nodes as planned.

 Yeah, I know, beautiful layout. That is WHY we need to export it to GEPHI and beautify it there.

Btw, the layout assignment will throw two warnings but please don't ask me why. I hope I will not have to enquire that as well... ;-)

You might want to check that the right nodes got the right color:
` `
```V(subgraph)\$name[V(subgraph)\$color==3]
V(emailNW)[randomSample]
```

 Gephi's version
So, here is the Gephi layout. Of course, a hairball is a hairball - but: did you see the isolated component beforehand?

## Monday, July 25, 2011

### Visualizing and Coloring of Induced Subgraphs

If the network is too large to visualize it as a whole, it might be a good idea to choose some nodes at random and to visualize their direct surroundings, also called their ego network. The ego network comprises the direct neighbors of a node and the relationship between these neighbors. In this example we will combine all edges between 10 randomly chosen nodes and their direct neighbors. The data set we are working on is freely available from Alex Arena's homepage. We use his data on the email contact network of the University Rovira y Virgili and we will assume that it is stored as 'email_Arenas.txt'. (For all readers who are fluent in latex, here is the according Sweave-file which will create all figures in eps/pdf/png format. It assumes you have set the working directory to where itself and the data is located. )

```> library(network)
> setwd("path-to-wherever-you-stored-the-data")

# create a network from the data frame
> emailNW <- network(tableEmail, directed = T)

# choose 10 random vertex IDs without replacement
> randomSample <- sample(1:network.size(emailNW), 10, replace = FALSE)

# show the randomly drawn vertex IDs
> randomSample
[1]  843  548  691  871  524  858  290 1059  485  593

#initialize a new vector
> neighs <- vector()

#for all vertex IDs in the random sample
> for (x in randomSample) {

# add themselves and their direct neighborhood to the vector 'neighs'
>  neighs <- c(neighs, x, get.neighborhood(emailNW, x, type = "combined"))
> }

#create the induced subgraph
> igraph <- get.inducedSubgraph(emailNW, neighs)
```
Now, we would like to see this random sample from the graph:
```> plot(igraph)
```

 An induced subgraph based on 10 randomly drawn seed nodes and their direct neighborhoods.

The induced subgraphs gives an overall impression on how dense the local neighborhoods and the connections between 10 randomly drawn neighborhoods are. It would, however, be helpful to identify the 10 seeds. This can be done by coloring them accordingly. This was my first try:
```#make a vector of size n (=#number of nodes), assigning color 2 as default
> color <- rep(2, times = network.size(emailNW))
#for the seed nodes, assign color number 3
> color[network.vertex.names(emailNW) %in% randomSample] = 3
#plot the graph and assign the color vector
> plot(igraph, vertex.col = color[network.vertex.names(igraph)])
```

Interestingly, this does not give the wanted results; it seems as if there were no seed vertices at all:

 A test plot in which all seed nodes should have been colored in green. Did not work, obviously.

If you would run the same code again and again, you would sometimes see one or two or even more green vertices. So, what happens (or seems to happen) is the following: the induced subgraph creates a new graph with obviously fewer vertices than the original graph. The vertex names themselves are maintained, and their order is maintained as well - you can check that by typing network.vertex.names(igraph). But only the first n entries in vertex.col are used as an assigment to the colors. Thus, we need to reduce the color-vector to those entries which are contained in igraph:

```> plot(igraph ,vertex.col=color[network.vertex.names(igraph)])
```

This will finally produce the graph with all 10 seed nodes colored in green:

 The final result with all 10 seed nodes colored in green.

## Saturday, July 23, 2011

### Importing Data into the network Package in R (II)

Okay, let's proceed with the last example to find another interesting behavior of the network and igraph packages in R. Again, we will assume you downloaded the autonomous system network as compiled by Jure Leskovec, named it "edges.txt" and saved it into your favourite directory "favDir". We will now import it:

```> setwd("path-to-favDir")
> library(network)
```
We now look at the lines 500 to 600 in the table and create a network from these lines:
```> myEdges[500:600,]
> network <- network(myEdges[500:600,])
```
In this part of the table, there are exactly 101 edges, all with the same source node with ID 1 and 101 other nodes. Thus, we would expect that the resulting network has 102 nodes. So, let's check:
```> network.size(network)
> 568
```
So, essentially what happens is that the network()-function creates a node for each ID from 1 until the maximum ID of 568. This is a (bug or feature?) side-effect of the IDs being numeric, i.e., of type int in the data.frame myEdges. If you again tell R to read in the nodes' IDs as characters, the resulting network will have the expected 102 nodes:
```> myEdges <- read.table("edges.txt", header=T, colClasses=c("character", "character"))
> network2 <- network(myEdges[500:600,])
> network.size(network2)
> 102
```

So, if you ever experience long transformation times from data.frame to network, check whether your IDs have a numeric type and whether you have IDs which are much higher than the number of nodes in the network.

### Importing Data into the network Package in R (I)

Importing network data into R seems to be so simple. We will assume that you have an edge list in a data file called
`edges.txt`
`favDir`

It could, e.g., be a data set like the connections of autonomous systems in January 2000, as compiled by Jure Leskovec of which we show the first lines:

```# Undirected graph: as20000102.txt
# Autonomous Systems (from the BGP logs) from January 02 2000
# Nodes: 6474 Edges: 13233
FromNodeId ToNodeId
0 1
0 2
0 3
0 4
0 5
0 6
```

The first three lines are comments, as indicated by the hash symbol. We have removed the hash symbol from the fourth line as this contains the names of the columns. This data can easily be read into R:

```> library(network)
> library(sna)
> setwd("path-to-favDir")

> network <- network(myEdges)
```
But something is strange, if we look at the number of nodes in the network:
```> network.size(network)
> 6476
```
But this does not match with Jure's statistics: he says that the network has only 6474 nodes! So, let's look at the nodes' names in the network:
```> tail(network.vertex.names(network))
> "996"        "997"        "998"        "999"        "FromNodeId" "ToNodeId"
```
With
`tail`
we only see the last six entries of all vertex names. But we can see, what happened: the header information "FromNodeId" and "ToNodeId" was interpreted as an edge between two nodes named "FromNodeId" and "ToNodeId". This can be remedied by adding
`header=T`
```> myEdges2 <- read.table("edges.txt", header=T)
> network2 <- network(myEdges)
```
But now we get a very strange error message which is not particularly helpful:
```Error in if (matrix.type == "edgelist") { :
missing value where TRUE/FALSE needed
```
By reducing the data set a bit, you would find that nothing happens as long as you do not have a node with ID 0. But why was that not a problem beforehand? R makes some implicit assumptions when reading in data files. In the first go, the first real line of the file was interpreted as an edge between two nodes with names "FromNodeId" and "ToNodeId". Since these 'names' consisted of letters, R imported all subsequent IDs as so-called "factors", i.e., observations of different categories, where the names are interpreted as the possible categories.. This can be seen by looking at the structure of myEdges:
```> str(myEdges)
> 'data.frame':   13896 obs. of  2 variables:
> \$ V1: Factor w/ 2257 levels "0","1","10","100",..: 2257 1 1 1 1 1 1 1 1 1 ...
> \$ V2: Factor w/ 6431 levels "1","10","100",..: 6431 1 1109 2217 3318 4419 5524 6103 6214 6322 ...
```
If this is transformed into a network, the "category names" are imported as strings of letters. But the new import explicitly said that the first line contains header information, so the first 'real' line is recognized as numbers by R. Let's look at the structure of myEdges2:
```> str(myEdges2)
> 'data.frame':   13895 obs. of  2 variables:
> \$ FromNodeId: int  0 0 0 0 0 0 0 0 0 0 ...
> \$ ToNodeId  : int  1 2 3 4 5 6 7 8 9 10 ...
```
So, essentially, R was clever to recognize the right type of the IDs (namely int). If we now try to transform this data.frame into a network, the network-package kicks in: it just doesn't like a numerical ID equal to 0. There are now two remedies:
```> myEdges3 <- read.table("edges.txt", header=T)
> myEdges3\$FromNodeId <- myEdges3\$FromNodeId+1
> myEdges3\$ToNodeId <- myEdges3\$ToNodeId+1
> network3 <- network(myEdges3)
> network.size(network3)
```
With the second and third line we increased all IDs by one and re-assigned these values to the two columns. Now, the network transformation proceeds without error and the number of nodes is correct with 6474. The other approach is to tell R explicitly to read in the IDs as "names", i.e., strings of letters:
```> myEdges4 <- read.table("edges.txt", header=T, colClasses=c("character", "character"))
> network4 <- network(myEdges4)
> network.size(network4)
```

Now we enforced the import of the nodes' IDs as characters, and as a name the network package does not care about the 0.

Both approaches have their problems: in the first, you change the IDs and if you have additional node attributes identified by the ID, you need to take care to make them match. Similarly, if the ID is imported as a character you might have more problems to match (numeric) IDs with the character versions of themselves.