|
Michael W. Mahoney
I am currently (2008) doing work on statistical and algorithmic aspects of
large informatics graphs.
Recent work with
Jure Leskovec,
Kevin Lang,
and Anirban Dasgupta
on the community structure in large social and information networks is
described below and has resulted in the following papers:
-
Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters,
-
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
-
Technical Report, Preprint: arXiv:0810.1355 (2008)
(arXiv).
-
Statistical properties of community structure in large social and information networks,
-
J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
-
Proc. 17-th International WWW, 695-704 (2008)
(ps,
pdf).
We are working at standardizing and making publicly-avaliable as many of the
networks we have analyzed as possible.
It will take a bit of time, but click
here
for some initial data.
Community Structure in Large Social and Information Networks
A large body of work has been devoted to defining and identifying communities
in social and information networks, i.e., in graphs in which the nodes
represent underlying social entities and the edges represent some sort of
interaction between pairs of nodes.
Most such research begins with the premise that a community should be thought
of as a set of nodes that has more and/or better-connected edges between its
members than between members of that set and the remainder of the
network---e.g., a set of nodes that has good properties with respect to
the combinatorial quantity known as conductance.
We explore from a novel perspective several questions related to
identifying meaningful communities in large social and information networks,
and we come to several striking conclusions that have implications for
community detection in such networks.
Rather than define a procedure to extract a set of nodes from a graph and then
attempt to interpret that set as a meaningful community, we will employ
approximation algorithms for the graph partitioning problem in an attempt to
characterize as a function of size the statistical and structural properties of
partitions of graphs that could plausibly be interpreted as meaningful
communities.
In particular, we define the network community profile plot, which
attempts to characterize the "best" possible community---according to the
conductance measure---over a wide range of size scales.
We study over 70 large sparse real-world networks taken from a wide range
of application domains (ranging from traditional and on-line social networks,
to technological and information networks and web graphs, and ranging in size
from thousands of nodes up to tens of millions of nodes), and for each of
these networks we compute a wide range of statistics, including "regularized"
and "non-regularized" versions of the network community profile plot.
Taken together, our empirical results suggest a significantly more refined
picture of the community structure in large real-world networks than has been
appreciated previously.
For example, we find that the "best" communities tend to be quite
small---e.g., no more than about 100 nodes---and barely connected to
the rest of the network.
We also observe that beyond this size scale the best possible communities
gradually blend into the expander-like core, and thus that a roughly inverse
relationship exists between community size and optimal community quality.
Local fluctuations are common, and even at the largest size scales we find
local graph structure such that these networks are significantly less
expander-like than corresponding random graphs.
In nearly every network dataset we examined, however, we observe tight but
almost trivial communities at very small scales, and at larger and larger size
scales, the best possible communities gradually blend in more and more with the
rest of the network and thus become less and less community-like.
This behavior is not explained, even at a qualitative level, by any of the
commonly-used network generation models.
Moreover, this behavior is exactly the opposite of what one would expect based
on experience with and intuition from expander graphs, from graphs that are
well-embeddable in a low-dimensional structure, and from small social networks
that have served as testbeds of community detection algorithms.
Certain aspects of it, e.g., the existence of deep cuts or well-defined
"communities" at small size scales and the non-existence of them at very
large scales, are a consequence of the extreme sparsity of the networks, as we
demonstrate by analyzing sparse random graph models.
Other aspects of it, e.g., the relatively gradual increase of the
network community profile plot as a function of increasing size scale, depend
in a subtle manner on the way in which local clustering information is
propagated from smaller to larger size scales in the network.
We have found that a generative graph model, in which new edges are added via
an iterative "forest fire" burning process, is able to produce graphs
exhibiting a network community profile plot similar to what we observe in our
network datasets.
|