Michael W. Mahoney

I am currently (2008) doing work on statistical and algorithmic aspects of large informatics graphs. Recent work with Jure Leskovec, Kevin Lang, and Anirban Dasgupta on the community structure in large social and information networks is described below and has resulted in the following papers:
  • Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters,
  • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
    Technical Report, Preprint: arXiv:0810.1355 (2008) (arXiv).
  • Statistical properties of community structure in large social and information networks,
  • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
    Proc. 17-th International WWW, 695-704 (2008) (ps, pdf).
We are working at standardizing and making publicly-avaliable as many of the networks we have analyzed as possible. It will take a bit of time, but click here for some initial data.


Community Structure in Large Social and Information Networks

A large body of work has been devoted to defining and identifying communities in social and information networks, i.e., in graphs in which the nodes represent underlying social entities and the edges represent some sort of interaction between pairs of nodes. Most such research begins with the premise that a community should be thought of as a set of nodes that has more and/or better-connected edges between its members than between members of that set and the remainder of the network---e.g., a set of nodes that has good properties with respect to the combinatorial quantity known as conductance. We explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions that have implications for community detection in such networks.

Rather than define a procedure to extract a set of nodes from a graph and then attempt to interpret that set as a meaningful community, we will employ approximation algorithms for the graph partitioning problem in an attempt to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as meaningful communities. In particular, we define the network community profile plot, which attempts to characterize the "best" possible community---according to the conductance measure---over a wide range of size scales. We study over 70 large sparse real-world networks taken from a wide range of application domains (ranging from traditional and on-line social networks, to technological and information networks and web graphs, and ranging in size from thousands of nodes up to tens of millions of nodes), and for each of these networks we compute a wide range of statistics, including "regularized" and "non-regularized" versions of the network community profile plot.

Taken together, our empirical results suggest a significantly more refined picture of the community structure in large real-world networks than has been appreciated previously. For example, we find that the "best" communities tend to be quite small---e.g., no more than about 100 nodes---and barely connected to the rest of the network. We also observe that beyond this size scale the best possible communities gradually blend into the expander-like core, and thus that a roughly inverse relationship exists between community size and optimal community quality. Local fluctuations are common, and even at the largest size scales we find local graph structure such that these networks are significantly less expander-like than corresponding random graphs. In nearly every network dataset we examined, however, we observe tight but almost trivial communities at very small scales, and at larger and larger size scales, the best possible communities gradually blend in more and more with the rest of the network and thus become less and less community-like.

This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. Certain aspects of it, e.g., the existence of deep cuts or well-defined "communities" at small size scales and the non-existence of them at very large scales, are a consequence of the extreme sparsity of the networks, as we demonstrate by analyzing sparse random graph models. Other aspects of it, e.g., the relatively gradual increase of the network community profile plot as a function of increasing size scale, depend in a subtle manner on the way in which local clustering information is propagated from smaller to larger size scales in the network. We have found that a generative graph model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.