Network Analysis Literacy

Saturday, June 18, 2011

Arts, Humanities, and Complex Networks at NETSCI 2011 – Fueling my scientific batteries

Due to the generosity of various sponsors, the AHCNworkshop did not have a registration fee – but since the location was limited in size, participants were asked to register for a cost-free ticket. At the evening before the workshop I found out that I was 83^rd on the waiting list for such a ticket! Nonetheless, I just went there next morning and was lucky that some of the registered participants had not shown up. But boy, they really missed something!

I had very high expectations towards this workshop since a similar one at the ECCS’10 (also organized by Maximilian Schich) had already been splendid – but all my expectations were met and even exceeded. The workshop took place in the very modern building of the Ludwig Muzeum, directly located at the Danube. Everything was splendidly and most generously organized: no fee, but an extraordinary good caterer with plenty of coffee and pogacsa supply (small Hungarian bread), and the most inspiring talks I heard on the whole conference. Do I sound enthusiastic? I definitely am! The broad field of topics, the enthusiasm of the many people which just started to explore the possibilities of network analysis for their field was just energizing. I’ll give you an overview of the topics and some of the (subjectively) most interesting points.

The first keyword was given by Marek Claassen, on how to automatically score artists. Of course, there is the obvious way by simply summing up the money paid for their works, but Claassen proposed a more detailed model in which fame and recognition are quantified. The main idea is that an exhibition, especially if it is a solo exhibition at a renowned institute, gives more attention to the user and that this is like a currency. Of course, such a talk rises a lot of questions and even emotions but it was an interesting start into a workshop which focuses exactly on this frontier: are we able to explore human culture by quantifying stochastical traces in our digital, virtual world?

Next, Tom Brughman showed his enthusiasm for using network analysis in archeology: he showed how he links various sites by the types of artifacts found at these sites. This research is still young and ít is not yet clear what kind of questions can be answered with it – which makes it all the more interesting for network analysists.

An absolute highlight was the keynote talk by Jean BaptisteMichel about Culturomics [1][2]. So, he and his co-authors wondered: Hey, within 10 years we can either read a few books very concentrated---or we could read 5 million books very superficial – let’s find out how much we can do with the latter! And they started to use 5 million digitized books from Google books which amounts to approximately 4% of all books ever published. In these books they looked for the probability that an irregular word becomes regularized, the time it takes until a new invention like the car, telephone or washing machines makes it into a book, or which profession is the most percepted and the one most persistently found in books. The result of the latter is clear: if you’re looking for fame you need to become a politician, but never a mathematician! I was very much astonished how far their approach took them – amazing talk. I’m sure we will hear more of him and his co-authors.

Even after this really great talk, the next invited speaker Natalie Henry Riche had absolutely no problem to switch our focus to a totally different but equally fascinating topic: visualization of large complex networks. She has developed different software tools which all look very interesting: in her PhD she tried out various ways to display graphs as adjacency matrices or to integrate adjacency matrices (e.g., of dense groups) with a ‘normal’ 2D-embedding of the graph. Within the adjacency matrix approach she, e.g., had the idea to fold parts of it away so that you can compare distant rows or columns with each other without losing track of what belongs where. Quite cool idea! Her homepage does not directly link to downloads of her software but she mentioned that by request there might be a way to get it.

The next talk that really grabbed my attention was by RobinWilkins which presented her and her co-authors' work on how the brain reacts to different types of music . For all persons in the experiment, the experimenter knew which song they loved most and which type of music they liked in general. The scientists put together a long record of all songs plus the favorite song of the respective test person. Looking at the working brain it became clear that the brain reacts totally different to songs it likes than to those it does not understand (music from a different culture) or those that it does not like. Especially, Wilkins et al. looked at how well the different parts of the brain, e.g., those doing the hearing or those concerned with emotions and memory, were synchronized during the different songs.

In summary: all of the talks were full of enthusiasm for bringing together arts, the humanities, and network analysis. It was just very refreshing to feel this enthusiasm, and I’ll make sure next time to register for Max’ great workshops at the earliest. So, please make sure you keep up the good work, Max!

Thursday, June 9, 2011

Circuits of profits – a satellite event of NetSci 2011

Maven7 managed to create an open atmosphere, attracting a good mixture of business managers, business consultants, and scientists in the well-equipped building of the Central European University. The density of ties and suits was much higher than in a normal physics or computer science conference – which also guaranteed better coffee and cake selections: the caterer was from the famous Gundel restaurant.

The keynote speech was given by Albert-László Barabási who summarized different aspects of the evolution of network analysis, which might be most influential for economy: he took us on a time travel, beaming us through the preferential attachment model and its close cousin the fitness model. He pointed out clear connection to markets: it is not always the first move into a market that will make you big; if you have a genuinely new function to provide, you can still win the market even if you come later. It is known that this model can lead to a Bose-Einstein-condensate if one of the nodes has an extremely high fitness, i.e., even a late-comer can grab a big share of all subsequent customers: In the evolution of such a network this will turn into a constant percentage of edges attached to this one node while in the normal preferential attachment model, the maximal degree will rather have a shrinking relative share of all edges. Barabási showed that essentially the operating system market might be of this fitness type since Microsoft has had a market share of 85% for the last two decades. I’m not sure whether this model holds since this network is fundamentally different from the preferential attachment model: most nodes in the operating system network are normal users, not companies producing and selling operating systems. Thus, a new node is much more likely to be a user and she can only choose from those producers that are already in the market. It would certainly be interesting to test whether the model applies to this slightly different situation. He also briefly mentioned the link community algorithm by Sune Lehmann, which, in my personal view, is one of the best ideas about clustering in networks in years. The link points to the enormous supplemental material presented along with the article.

The next invited speaker was Hamid Benbrahim who made a very interesting point: he introduced the “sigma-delta-disconnect”, namely that we either know much about aggregates (sums, represented by the sigma) or about the difference between things (represented by delta). In essence, he pronounced that there is a gap in our understanding of the macro and the micro level of our complex financial markets. He also pointed out that due to the new technology our markets have now synchronized much more than 10-20 years ago because any small leverage between markets is now easily identified and can be harnessed within milliseconds – virtually at the speed of light. He also showed a case in which two software agents created a small crash on the price of some stock because the first one wanted to sell a huge chunk of it. To not decrease the price, the agent is of course smart enough to offer it in small portions – however, the second agent is programmed in detecting behavior like this and guesses that there will be a large chunk sold and bets against the price. This leads to the predicted decrease of the price until finally the process was stopped. After a five seconds break, other agents realized that the price of the stock was undervalued and started buying it and after around an hour the price was back to normal. This shows how the modern trading system can fall into positive feedback loops that leverage the system out of its equilibrium position.

Alberto Calero made an interesting remark that the Gartner report on Ten Technology Trends 2011 contained three key technologies related to network analysis: Social communication and collaboration, social analytics, and next generation analytics. He also shared his experience on how to convince fellow CEOs that network analysis can help in business decisions: he did not really disclose the details but it became clear that network analysis was an eminent part in advertising the new services made available by mobile phones in the 90s like SMS. He also reported on a new campaign in which customers are rewarded in groups, and he emphasized that attracting groups of people to a mobil phone provider might become much more important than attracting single customers. This was of course also a big topic in the later talks that reported on different churn models and the strategies to prevent it.

Valdis Krebs finally spoke on how to communicate network analysis results. He reported from his experience with various case studies and emphasized that we need new measures and new ways to report them. One of his customers once showed him a simple chart with a curve: the curve goes up – everything is fine. The curve stalls: watch out! The curve goes down – not good at all. Curve goes up again: you’re back into business. He asked whether it was possible to turn network analytic results into something as simple as that. So, a first step in this seems to be to replace all occurrences of ‘social network analysis’ by ‘organization network analysis’. Valdis’ answer to that request is, e.g., not to show networks, but to show a Venn diagram representation of an embedded and clustered network instead since that is more digestible. He also emphasized to develop a standardized procedure, use only a few standardized programs, and allow for the transfer of best practice. In general his advice is: make it simple.

This last point basically was the outcome of many of the talks later as well: do not make a research project out of every case study but communicate your results in a simple and short manner. In the round table discussion, the invited speakers agreed (make that ‘round chair discussion’) that we need to follow a schematized, standardized way of doing a network analysis report on economical networks – be it communication, trust, or inspiration networks. Maven7 is helping in that by promoting their first software tool, aiming at consultants that want to include network analysis to their repertoire. A second main point was that maybe we focus too much on single nodes and their position in a network. Helping a company should not boil down to “try to connect these three people more” but rather in creating an atmosphere in which positive network structures can emerge.

Thursday, December 2, 2010

Review on: "The relationship between domain duplication and recombination"

The following paper gives a very nice example of a biological network analysis in which many of the nine phases of a network analytic project can be found:

Christine Vogel, Sarah A. Teichmann, and Jose Pereira-Leal
"The Relationship between Domain Duplication and Recombination", J. Mol. Biol. 346(1):355-65, 2005

0) Pose the question: Are domain combinations just determined by random combinations of domains with different abundance or is there a further step of selection?
1) Build the network: Domains are characterized into different superfamilies and two superfamilies are connected by a directed edge if they are directly adjacent in the N->C direction of a gene in at least one gene.
2) Choose subgraph H: ego-network
3) Choose measure: size of ego-network (versatility) vs. external measure, namely the abundance of the node (number of times the domain appears in the genome).
4) Null-model: random combinations of domains while maintaining the abundance of the node

First result: the relationship between versatility and abundance is very stable, independent of how the data set is divided up into different categories. Tested were: division by structural class, phylogeny, protein's function, or the process it contributes to. This result thus shows that the null-model of random combination cannot be rejected.

5) Designing the model: the null-model maintains the abundance of each domain, the number of domains per gene, and their total number per genome.
6) Comparison observation/model: The versatility of the nodes in this model is much higher than in reality, i.e., in the model domains have fewer adjacent domains than in reality.
7) Design of a new model that suits the findings better: here, domains are shuffled and then duplicated until one of the component domains is used up.

Processes were not tested on any of the models, thus steps 8/9 are non-existing.

Wednesday, October 20, 2010

The nine phases of a network analytic project

In a bold attempt to tame network analytic literature, I have boiled down network analytic projects to the following nine phases:

Pose the question;
Build the network G;
Choose a set of subgraphs H;
Design a measure on H in G;
Choose an appropriate null-model, i.e., choose a suitable random graph model G';
Measure significance of H with respect to G';
Design a model M that evolves all significant structures in H;
Design a process P on top of the network;
Analyze the interplay between M and P.

You don't find this scheme in your favourite network analytic paper? So, let's check with the classic papers of Watts and Strogatz and Barabasi and Albert.

A first example: The Collective Dynamics of Small-World Networks by D. Watts and S. Strogatz

The authors start with the question of whether real-world networks are rather of the ordered, local type or whether they are of the purely random type.
To check this question, they construct three types of networks: the known neural connections of the worm Caenorhabditis elegance; the actors that were co-casted for any film in the IMDB at that time; and the U.S. power grid.
They actually define two sets of subgraphs: one contains all shortest paths in G, the other contains for each node the so-called ego-network, i.e., the set of edges to and between its neighbors.
The measures they define are once the average length of all paths, and the clustering coefficient of each ego-network.
The null-model they choose is the classic random graph model G(n,p).
Although they do not directly assess the significance of the values in G, they compare the observed values with the one in the according random graph. From the difference between these values it is quite obvious that also the significance of these values is strong.
The small-world model captures the newly found significant structure, namely the combination of high clustering coefficient and low average path length. Moreover, this model has only one parameter p that enables them to scale between a rather random graph and a strongly ordered one.
They also define some processes on networks, namely a version of an iterated prisoner's dilemma, synchronization of coupled oscillators, and a simple voting model.
In a last step they analyze the interplay of the outcome of any of the three process with respect to the chosen parameter p.

A second example: Emergence of Scaling in Random Networks by A.-L. Barabási and R. Albert

The authors pose the question of what kind of degree distribution we normally see in a real-world network.
They used the same actor and power grid networks as Watts and Strogatz and added a network constructed from a partial crawl of the web at that time.
The set of subgraphs they defined is given by each vertex and its incident edges.
The measure on this set is simply the number of edges in each subgraph, the degree of each vertex. It is this time aggregrated and viewed as a measure of the whole graph.
Again, the authors use G(n,p) as the suitable random graph model and also the newly introduced small-world model.
They compare the expected degree distributions in both null-models with the ones observed in the real networks and it is obvious that these deviate significantly from each other (although this is not formally assessed).
The preferential attachment model is then introduced to capture these newly found significant structures.
In a second paper (Error and Attack Tolerance of Complex Networks) they define two processes on networks, namely the random removal of nodes (error/random failure) and the deliberate removal of the best connected nodes (attack).
They then analyze the behavior of these two processe with respect to the underlying graph model, namely the fully random graph from the G(n,p) model and the newly designed preferential attachment model.

You see that the phases capture these papers quite well. Of course, we restrict ourselves to those network analytic projects on static networks, so far. I am convinced that most of our network analytic pet methods like centrality measures, clustering, network motifs, and position/role assignment can be meaningfully described within this framework. Let me know your opinion!

Saturday, September 25, 2010

ECCS 2010 - Lisbon

I just came back from the ECCS 2010 in Lisbon and want to share some of the great talks with you as long as the videos are not yet there. The talks in the actual conference were not as good as expected, but luckily there were fabulously organized satellite meetings. Especially good was the High Throughput Humanities workshop which was excellently organized by Riley Crane , Gourab Ghoshal, Sune Lehmann, and Max Schich.They provided the audience with a great variety of interesting topics well presented by enthusiastic speakers (which was kind of a contrast to the actual conference).

The talk that I certainly liked the most was the one by Sebastian Ahnert with the title: "Mapping Flavour Space". He came up with an experiment to test the hypothesis that "pairs of foods which share chemical flavour compounds also taste well together". To do so, he analyzed thousands of recipes all over the world and checked whether the ingredients rather share chemical flavour components or not. The article is not yet out but I'll keep you updated on this.

Also extremely interesting was the talk by Alan Mislove who has analyzed several large online social networks. One of the cool project he shows is Twittermood: it is a dynamic map for which he first evaluated tweets from a given region at a given time corresponding to their mood and then showed the active regions on an density-preserving map. It shows periodic pattern on different time scales but also that the west coast is consistently happier than the east coast. California, here we come! Of course there are technial difficulties due to the automatic mood detection. The algorithm is of course not able to detect irony or slang. So, when a Californian says 'that's sick' it does not necessarily mean she is in a bad mood ;-)

Unfortunately, I missed the other talks before lunch since I had to give my talk on one-mode projections of bipartite graphs in the workshop Science of Complex Networks. After lunch I enjoyed the talk by Sang Hoon Lee on "Googling social interactions" which appeared shortly beforehand in PLoS ONE. In this work they simply test for a group of n persons how often Google can find them on the same page. They interpret the number (with some caution) as the degree to which the two persons are interacting. They did a very interesting pre-election analysis of the changing social interactions between the most important members of the two parties. The approach is of course not without problems: Google does not give definite counts of the number of pages it found and it does not make sure that all of the pages are sufficiently different. Second, people might be named in a given article or blog but simply because they are doing the same (candidate for a seat) but not because they did something together. Third, people might be named together because they actually hate each other. Lee et al. did a good job of describing these difficulties. Still, such an analysis might reveal new and totally unforeseen connections that can then be analyzed with better relational data. For me, it is an ideal method to find the 'needle in the haystack' with a big fork before analyzing whether the needle is made of steel or of gold. Thus, with the necessary caution this new approach might be a very interesting first step for a large scale network analysis.

Sang Hoon is now with Petter Holme who gave an interesting talk on "Networks of Internet mediated prostitution". Basically, the research was on a forum where men can rate female prostitutes in Brasil. As always, he was witty and self-ironic. I liked it that he cited a statement from a famous scientist (Elizabeth Pisani) on HIV that made a blog entry with the title "Swedes make sex boring, even in Brazil" ;-) I don't agree with her on this one: it is actually a difficult topic for a scientist and I think they did a good job in presenting the data. Anyway, she also has a very interesting perspective on complex systems so make sure you see her TED talk.

I also liked Alexander Mehler's talk about categorization inconsistencies in Wikipedia a lot. Of course, if a categorization was perfect, it should be a tree or maybe a DAG. That is not what happens in Wikipedia, however, it is almost nearly a DAG. It seems that the paper itself is not yet out, but check his homepage to find related papers.

The talk by Leif Isaksen was quite funny. He and his group got a Google award by proposing a good project on what one could do with Google books. They proposed to scan through them and identify places, if possible in the timely order in which they come up in the book. The information can then be used to visualize the action in a single book or series, or to find all books in which a certain place is named. This sounds far more easy than it actually is. Of course, many cities have quite different names in different languages (and they might be named by any of them, even if the book is English), the name of a city can change over time (Lisbon has been Olissipo earlier :-)), the name can be a common word in a text (as stated in the talk there is a city called 'A' and one called 'B'), and finally, there can be different cities with the same name. So, it will be interesting to see how the project evolves.

Sunday, January 3, 2010

The same closeness, very different degrees!

At the moment, we are analyzing how innovations spread in social groups. For this task, we needed a graph family where two nodes have the same closeness centrality, but different degree centralities. We will use this graph family to understand whether the observed effect rather correlates with the degree or the closeness of a node. Since it might be of intereste for other processes as well, we present it here.

Corollary:
For each parameter k>0, there is a tree in which one node with degree k+1 and another with degree 2 both have the same closeness centrality of k*(k+1).

Proof:
We start with the construction of the tree: first, 2k+1 nodes are connected to a chain, and labeled from v₁ to v_2k+1.Then, k nodes are added as neighbors to node v1. We will now determine the sum of distances distsum(v1) of node v1 to all other nodes in the tree. For each of the k added neighbors, the distance to v1 is 1. For the other nodes vi, the distance is i-1. Thus, distsum(v1) is given by k+1+2+...+2k =k+2k*(2k+1)/2 = k+2k^2+k=2k^2+2k=2k(k+1). Node vk+1 has two nodes in distance i each, for 1<=i<=k, summing up to 2*k(k+1)/2. The distance to the k additional neighbors of v1 is k+1, adding up to another summand of k*(k+1). Thus, distsum(vk+1) is also given by 2k(k+1).