Sublinear Estimation of Entropy and Information Distances

Publication Date

2009

Journal or Book Title

ACM TRANSACTIONS ON ALGORITHMS

Abstract

In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓp distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice.

We present a range of sublinear-time algorithms in various oracle models in which the algorithm accesses the data via an oracle that supports various queries. In particular, we answer a question posed by Batu et al. on testing whether two distributions are close in an information-theoretic sense given independent samples. We then present optimal algorithms for estimating various information-divergences and entropy with a more powerful oracle called the combined oracle that was also considered by Batu et al. Finally, we consider sublinear-space algorithms for these quantities in the data-stream model. In the course of doing so, we explore the relationship between the aforementioned oracle models and the data-stream model. This continues work initiated by Feigenbaum et al. An important additional component to the study is considering data streams that are ordered randomly rather than just those which are ordered adversarially.

DOI

https://doi.org/10.1145/1597036.1597038

Pages

-

Volume

5

Issue

4

This document is currently not available here.

Share

COinS