Saturday, July 23, 2011

Importing Data into the network Package in R (I)

Importing network data into R seems to be so simple. We will assume that you have an edge list in a data file called
edges.txt
in your favourite directory
favDir

It could, e.g., be a data set like the connections of autonomous systems in January 2000, as compiled by Jure Leskovec of which we show the first lines:

# Undirected graph: as20000102.txt
# Autonomous Systems (from the BGP logs) from January 02 2000
# Nodes: 6474 Edges: 13233
 FromNodeId ToNodeId
0 1
0 2
0 3
0 4
0 5
0 6

The first three lines are comments, as indicated by the hash symbol. We have removed the hash symbol from the fourth line as this contains the names of the columns. This data can easily be read into R:

> library(network)
> library(sna)
> setwd("path-to-favDir")

> myEdges <- read.table("edges.txt")
> network <- network(myEdges)
But something is strange, if we look at the number of nodes in the network:
> network.size(network)
> 6476
But this does not match with Jure's statistics: he says that the network has only 6474 nodes! So, let's look at the nodes' names in the network:
> tail(network.vertex.names(network))
> "996"        "997"        "998"        "999"        "FromNodeId" "ToNodeId"  
With
tail
we only see the last six entries of all vertex names. But we can see, what happened: the header information "FromNodeId" and "ToNodeId" was interpreted as an edge between two nodes named "FromNodeId" and "ToNodeId". This can be remedied by adding
header=T
to the read.table function:
> myEdges2 <- read.table("edges.txt", header=T)
> network2 <- network(myEdges)
But now we get a very strange error message which is not particularly helpful:
Error in if (matrix.type == "edgelist") { : 
  missing value where TRUE/FALSE needed
By reducing the data set a bit, you would find that nothing happens as long as you do not have a node with ID 0. But why was that not a problem beforehand? R makes some implicit assumptions when reading in data files. In the first go, the first real line of the file was interpreted as an edge between two nodes with names "FromNodeId" and "ToNodeId". Since these 'names' consisted of letters, R imported all subsequent IDs as so-called "factors", i.e., observations of different categories, where the names are interpreted as the possible categories.. This can be seen by looking at the structure of myEdges:
> str(myEdges)
> 'data.frame':   13896 obs. of  2 variables:
> $ V1: Factor w/ 2257 levels "0","1","10","100",..: 2257 1 1 1 1 1 1 1 1 1 ...
> $ V2: Factor w/ 6431 levels "1","10","100",..: 6431 1 1109 2217 3318 4419 5524 6103 6214 6322 ...
If this is transformed into a network, the "category names" are imported as strings of letters. But the new import explicitly said that the first line contains header information, so the first 'real' line is recognized as numbers by R. Let's look at the structure of myEdges2:
> str(myEdges2)
> 'data.frame':   13895 obs. of  2 variables:
> $ FromNodeId: int  0 0 0 0 0 0 0 0 0 0 ...
> $ ToNodeId  : int  1 2 3 4 5 6 7 8 9 10 ...
So, essentially, R was clever to recognize the right type of the IDs (namely int). If we now try to transform this data.frame into a network, the network-package kicks in: it just doesn't like a numerical ID equal to 0. There are now two remedies:
> myEdges3 <- read.table("edges.txt", header=T)
> myEdges3$FromNodeId <- myEdges3$FromNodeId+1
> myEdges3$ToNodeId <- myEdges3$ToNodeId+1
> network3 <- network(myEdges3)
> network.size(network3)
With the second and third line we increased all IDs by one and re-assigned these values to the two columns. Now, the network transformation proceeds without error and the number of nodes is correct with 6474. The other approach is to tell R explicitly to read in the IDs as "names", i.e., strings of letters:
> myEdges4 <- read.table("edges.txt", header=T, colClasses=c("character", "character"))
> network4 <- network(myEdges4)
> network.size(network4)

Now we enforced the import of the nodes' IDs as characters, and as a name the network package does not care about the 0.

Both approaches have their problems: in the first, you change the IDs and if you have additional node attributes identified by the ID, you need to take care to make them match. Similarly, if the ID is imported as a character you might have more problems to match (numeric) IDs with the character versions of themselves.

No comments:

Post a Comment