Michael W. MahoneyI am currently (2005) doing work on the theory of randomized algorithms and the application of that theory to problems in scientific computation having to do with extremely large data sets. Recent work with Ravi Kannan and Petros Drineas is described below and has resulted in the following three papers:
Fast Monte Carlo Algorithms for MatricesWe are interested in developing and analyzing fast Monte Carlo algorithms for performing useful computations on large matrices. Examples of such computations include matrix multiplication, the computation of the Singular Value Decomposition of a matrix, and the computation of compressed approximate decompositions (e.g., the CUR decomposition) of a matrix. In the first paper, we present a Pass-Efficient model in which our algorithms may naturally be formulated and we present several algorithms for the approximation of the product of two matrices. In the second paper we present two algorithms for the computation of low rank approximations to a matrix. Finally, in the third paper we present two algorithms to compute a compressed approximate decomposition to a matrix that has several appealing properties.
Since such computations generally require time which is superlinear in the number of nonzero elements of the matrix, we expect our algorithms to be useful in many applications where data sets are modeled by matrices and are extremely large. For example, in Information Retrieval and Data Mining (two rapidly growing areas of research in computer science and scientific computation that build on techniques and theories from fields such as statistics, linear algebra, database theory, pattern recognition and learning theory) a large collection of $n$ objects, e.g., documents, genomes, images, or web pages, is implicitly presented as a set of points in an $m$-dimensional Euclidean space, where $m$ is the number of features that describe the object; thus, this collection may be represented by an $m\times n$ matrix $A$, the columns of which are the object vectors and the rows of which are the feature vectors.
Recent interest in computing with massive data sets has led to the development of computational models in which the usual notions of time-efficiency and space-efficiency have been modified. In the applications that motivate these data-streaming models, e.g., the observational sciences and the monitoring and operation of large networked systems, the data sets are much too large to fit into main memory and thus are either not stored or are stored in a secondary storage device which may be read sequentially as a data stream but for which random access is very expensive. Typically, algorithms that compute on a data stream examine the data stream, keep a small ``sketch'' of the data, and perform computations on the sketch. Thus, these algorithms are usually randomized and approximate and their performance is evaluated by considering resources such as the time to process an item in the data stream, the number of passes over the data, the additional workspace and additional time required, and the quality of the approximations returned.
The motivation for our particular ``pass-efficient'' approach is that in modern computers the amount of disk storage (external memory) has increased enormously, while RAM and computing speeds have increased, yet at a substantially slower pace. Thus, we have the ability to store large amounts of data, but not in RAM, and we do not have the computational ability to process these data with algorithms that require superlinear time. In order to provide a framework in which to view the algorithms we present, we first introduce and describe the Pass-Efficient model of data-streaming computation. In the Pass-Efficient model the computational resources are the number of passes over the data and the additional RAM space and the additional time required. Thus, our algorithms are quite different from traditional numerical analysis approaches and generally fit within the following framework. Our algorithms will be allowed to read the matrices from external storage a few, e.g., one or two or three, times and keep a small randomly-chosen and rapidly-computable ``sketch'' of the matrices in RAM. Our algorithms will also be permitted additional space and time that is linear or sublinear in the number of data elements in order to perform computations on the ``sketch''. The results of these computations will be returned as approximations to the solution of the original problem.
In all of our algorithms, an important implementational issue will be how to form the random sample. An obvious choice is to use uniform sampling, where each data object is equally likely to be picked. Uniform sampling can be performed blindly, i.e., the sample to be chosen can be decided before seeing the data; even when the number $N$ of data elements is not known in advance an element can be selected uniformly at random in one pass over the data. Uniform sampling fits within our framework and is useful for certain (restricted) classes of problems. To obtain much more generality, we will sample according to a judiciously chosen (and data-dependent) set of nonuniform sampling probabilities. This nonuniform sampling, in which in the first pass through the data we compute sampling probabilities (e.g., we may keep rows or columns of a data matrix with probability proportional to the square of their lengths) and in the second pass we draw the sample, offers substantial gains. For example, it allows us to approximately solve problems in sparse matrices as well as dense matrices.
The idea of sampling rows or columns of matrices in order to approximate various operations is not new. One of the main contributions of our work is to demonstrate that a ``sketch'' consisting of a small judiciously chosen random sample of rows and/or columns of the input matrix or matrices is adequate for provably rapid and efficient approximation of several common matrix operations. We believe that the underlying principle of using nonuniform sampling to create ``sketches'' of the data in a small number of passes (and ``pass-efficient'' approaches more generally) constitute appealing and fruitful directions for algorithmic research in order to address the size and nature of modern data sets.
Update: More recent work (2007) has improved all those worst-case additive-error matrix algorithms to matrix algorithms that come with worst-case relative-error performance guarantees. We accomplished this by relating the approximate low-rank matrix approximation problem to an approximate least squares regression problem, which itself is a problem of independent interest. For example, see: