Thumbnail Image

Massive Graph Analysis in the Data Stream Model

Graphs have become an abstraction of choice in modeling highly-structured data. The need to compute graph-theoretic properties of datasets arises in many applications that involve entities and pairwise relations between them. However, in practice the datasets in question can be too large to be stored in main memory, distributed across many machines, or changing over time. Moreover, in an increasing number of applications the algorithm has to make real time decisions as the data arrives, which puts further limitations on the time and space that can realistically be used. These characteristics render classical algorithmic approaches obsolete and necessitate the development of new techniques. The streaming model of computation takes these challenges into account, providing a trade-off between the resources used by the algorithm and its accuracy. A graph stream is defined by a sequence of edge insertions (and sometimes deletions) into an initially empty graph. The objective is to compute a certain property of the graph at the end of the stream while minimizing the amount of space the algorithm uses. In this model, we explore fundamental graph-theoretic problems that also serve as important primitives in massive graph analysis. Our results can be divided into three main categories: Finding large matchings and related problems. We describe two optimal algorithms for finding large matchings in dynamic (insert-delete) graph streams---an approximation of an arbitrary maximum matching and an exact algorithm under the assumption that the matching is of certain size. We also show how the techniques developed in these algorithms can be used to solve a variety of related problems such as vertex cover and hitting set in hypergraphs. We then concentrate on estimating just the size of the matching and present a series of sublinear results for the class of low arboricity graphs. Counting the number of cycles. We fully resolve in which settings there exist algorithms approximating the number of fixed length cycles that do not store the entire graph. For cycles of length five or greater, we show that no such algorithms exist. For triangles and four-cycles, we describe several counting results and a few lower bounds for the insert-only model, considering such parameters as the number of passes taken over the stream and its ordering. Vertex ordering problems in directed graphs. We consider such fundamental problems as topologically sorting a directed acyclic graph (DAG), checking whether the input is in fact a DAG, and finding a minimum feedback arc set. It can be shown that when the input graph is arbitrary, these problems have high space complexity in the streaming model. Thus, we concentrate on designing algorithms for tournaments and a certain family of random graphs. Together, these results complement the much more mature body of work on algorithms for undirected graph streams.
Research Projects
Organizational Units
Journal Issue
Publisher Version
Embedded videos