Network Analysis Literacy: 2010

Thursday, December 2, 2010

Review on: "The relationship between domain duplication and recombination"

The following paper gives a very nice example of a biological network analysis in which many of the nine phases of a network analytic project can be found:

Christine Vogel, Sarah A. Teichmann, and Jose Pereira-Leal
"The Relationship between Domain Duplication and Recombination", J. Mol. Biol. 346(1):355-65, 2005

0) Pose the question: Are domain combinations just determined by random combinations of domains with different abundance or is there a further step of selection?
1) Build the network: Domains are characterized into different superfamilies and two superfamilies are connected by a directed edge if they are directly adjacent in the N->C direction of a gene in at least one gene.
2) Choose subgraph H: ego-network
3) Choose measure: size of ego-network (versatility) vs. external measure, namely the abundance of the node (number of times the domain appears in the genome).
4) Null-model: random combinations of domains while maintaining the abundance of the node

First result: the relationship between versatility and abundance is very stable, independent of how the data set is divided up into different categories. Tested were: division by structural class, phylogeny, protein's function, or the process it contributes to. This result thus shows that the null-model of random combination cannot be rejected.

5) Designing the model: the null-model maintains the abundance of each domain, the number of domains per gene, and their total number per genome.
6) Comparison observation/model: The versatility of the nodes in this model is much higher than in reality, i.e., in the model domains have fewer adjacent domains than in reality.
7) Design of a new model that suits the findings better: here, domains are shuffled and then duplicated until one of the component domains is used up.

Processes were not tested on any of the models, thus steps 8/9 are non-existing.

Wednesday, October 20, 2010

The nine phases of a network analytic project

In a bold attempt to tame network analytic literature, I have boiled down network analytic projects to the following nine phases:

Pose the question;
Build the network G;
Choose a set of subgraphs H;
Design a measure on H in G;
Choose an appropriate null-model, i.e., choose a suitable random graph model G';
Measure significance of H with respect to G';
Design a model M that evolves all significant structures in H;
Design a process P on top of the network;
Analyze the interplay between M and P.

You don't find this scheme in your favourite network analytic paper? So, let's check with the classic papers of Watts and Strogatz and Barabasi and Albert.

A first example: The Collective Dynamics of Small-World Networks by D. Watts and S. Strogatz

The authors start with the question of whether real-world networks are rather of the ordered, local type or whether they are of the purely random type.
To check this question, they construct three types of networks: the known neural connections of the worm Caenorhabditis elegance; the actors that were co-casted for any film in the IMDB at that time; and the U.S. power grid.
They actually define two sets of subgraphs: one contains all shortest paths in G, the other contains for each node the so-called ego-network, i.e., the set of edges to and between its neighbors.
The measures they define are once the average length of all paths, and the clustering coefficient of each ego-network.
The null-model they choose is the classic random graph model G(n,p).
Although they do not directly assess the significance of the values in G, they compare the observed values with the one in the according random graph. From the difference between these values it is quite obvious that also the significance of these values is strong.
The small-world model captures the newly found significant structure, namely the combination of high clustering coefficient and low average path length. Moreover, this model has only one parameter p that enables them to scale between a rather random graph and a strongly ordered one.
They also define some processes on networks, namely a version of an iterated prisoner's dilemma, synchronization of coupled oscillators, and a simple voting model.
In a last step they analyze the interplay of the outcome of any of the three process with respect to the chosen parameter p.

A second example: Emergence of Scaling in Random Networks by A.-L. Barabási and R. Albert

The authors pose the question of what kind of degree distribution we normally see in a real-world network.
They used the same actor and power grid networks as Watts and Strogatz and added a network constructed from a partial crawl of the web at that time.
The set of subgraphs they defined is given by each vertex and its incident edges.
The measure on this set is simply the number of edges in each subgraph, the degree of each vertex. It is this time aggregrated and viewed as a measure of the whole graph.
Again, the authors use G(n,p) as the suitable random graph model and also the newly introduced small-world model.
They compare the expected degree distributions in both null-models with the ones observed in the real networks and it is obvious that these deviate significantly from each other (although this is not formally assessed).
The preferential attachment model is then introduced to capture these newly found significant structures.
In a second paper (Error and Attack Tolerance of Complex Networks) they define two processes on networks, namely the random removal of nodes (error/random failure) and the deliberate removal of the best connected nodes (attack).
They then analyze the behavior of these two processe with respect to the underlying graph model, namely the fully random graph from the G(n,p) model and the newly designed preferential attachment model.

You see that the phases capture these papers quite well. Of course, we restrict ourselves to those network analytic projects on static networks, so far. I am convinced that most of our network analytic pet methods like centrality measures, clustering, network motifs, and position/role assignment can be meaningfully described within this framework. Let me know your opinion!

Saturday, September 25, 2010

ECCS 2010 - Lisbon

I just came back from the ECCS 2010 in Lisbon and want to share some of the great talks with you as long as the videos are not yet there. The talks in the actual conference were not as good as expected, but luckily there were fabulously organized satellite meetings. Especially good was the High Throughput Humanities workshop which was excellently organized by Riley Crane , Gourab Ghoshal, Sune Lehmann, and Max Schich.They provided the audience with a great variety of interesting topics well presented by enthusiastic speakers (which was kind of a contrast to the actual conference).

The talk that I certainly liked the most was the one by Sebastian Ahnert with the title: "Mapping Flavour Space". He came up with an experiment to test the hypothesis that "pairs of foods which share chemical flavour compounds also taste well together". To do so, he analyzed thousands of recipes all over the world and checked whether the ingredients rather share chemical flavour components or not. The article is not yet out but I'll keep you updated on this.

Also extremely interesting was the talk by Alan Mislove who has analyzed several large online social networks. One of the cool project he shows is Twittermood: it is a dynamic map for which he first evaluated tweets from a given region at a given time corresponding to their mood and then showed the active regions on an density-preserving map. It shows periodic pattern on different time scales but also that the west coast is consistently happier than the east coast. California, here we come! Of course there are technial difficulties due to the automatic mood detection. The algorithm is of course not able to detect irony or slang. So, when a Californian says 'that's sick' it does not necessarily mean she is in a bad mood ;-)

Unfortunately, I missed the other talks before lunch since I had to give my talk on one-mode projections of bipartite graphs in the workshop Science of Complex Networks. After lunch I enjoyed the talk by Sang Hoon Lee on "Googling social interactions" which appeared shortly beforehand in PLoS ONE. In this work they simply test for a group of n persons how often Google can find them on the same page. They interpret the number (with some caution) as the degree to which the two persons are interacting. They did a very interesting pre-election analysis of the changing social interactions between the most important members of the two parties. The approach is of course not without problems: Google does not give definite counts of the number of pages it found and it does not make sure that all of the pages are sufficiently different. Second, people might be named in a given article or blog but simply because they are doing the same (candidate for a seat) but not because they did something together. Third, people might be named together because they actually hate each other. Lee et al. did a good job of describing these difficulties. Still, such an analysis might reveal new and totally unforeseen connections that can then be analyzed with better relational data. For me, it is an ideal method to find the 'needle in the haystack' with a big fork before analyzing whether the needle is made of steel or of gold. Thus, with the necessary caution this new approach might be a very interesting first step for a large scale network analysis.

Sang Hoon is now with Petter Holme who gave an interesting talk on "Networks of Internet mediated prostitution". Basically, the research was on a forum where men can rate female prostitutes in Brasil. As always, he was witty and self-ironic. I liked it that he cited a statement from a famous scientist (Elizabeth Pisani) on HIV that made a blog entry with the title "Swedes make sex boring, even in Brazil" ;-) I don't agree with her on this one: it is actually a difficult topic for a scientist and I think they did a good job in presenting the data. Anyway, she also has a very interesting perspective on complex systems so make sure you see her TED talk.

I also liked Alexander Mehler's talk about categorization inconsistencies in Wikipedia a lot. Of course, if a categorization was perfect, it should be a tree or maybe a DAG. That is not what happens in Wikipedia, however, it is almost nearly a DAG. It seems that the paper itself is not yet out, but check his homepage to find related papers.

The talk by Leif Isaksen was quite funny. He and his group got a Google award by proposing a good project on what one could do with Google books. They proposed to scan through them and identify places, if possible in the timely order in which they come up in the book. The information can then be used to visualize the action in a single book or series, or to find all books in which a certain place is named. This sounds far more easy than it actually is. Of course, many cities have quite different names in different languages (and they might be named by any of them, even if the book is English), the name of a city can change over time (Lisbon has been Olissipo earlier :-)), the name can be a common word in a text (as stated in the talk there is a city called 'A' and one called 'B'), and finally, there can be different cities with the same name. So, it will be interesting to see how the project evolves.

Sunday, January 3, 2010

The same closeness, very different degrees!

At the moment, we are analyzing how innovations spread in social groups. For this task, we needed a graph family where two nodes have the same closeness centrality, but different degree centralities. We will use this graph family to understand whether the observed effect rather correlates with the degree or the closeness of a node. Since it might be of intereste for other processes as well, we present it here.

Corollary:
For each parameter k>0, there is a tree in which one node with degree k+1 and another with degree 2 both have the same closeness centrality of k*(k+1).

Proof:
We start with the construction of the tree: first, 2k+1 nodes are connected to a chain, and labeled from v₁ to v_2k+1.Then, k nodes are added as neighbors to node v1. We will now determine the sum of distances distsum(v1) of node v1 to all other nodes in the tree. For each of the k added neighbors, the distance to v1 is 1. For the other nodes vi, the distance is i-1. Thus, distsum(v1) is given by k+1+2+...+2k =k+2k*(2k+1)/2 = k+2k^2+k=2k^2+2k=2k(k+1). Node vk+1 has two nodes in distance i each, for 1<=i<=k, summing up to 2*k(k+1)/2. The distance to the k additional neighbors of v1 is k+1, adding up to another summand of k*(k+1). Thus, distsum(vk+1) is also given by 2k(k+1).