net.sf.farrago.util
Class FarragoCardinalityEstimator

java.lang.Object
  extended by net.sf.farrago.util.FarragoCardinalityEstimator

public class FarragoCardinalityEstimator
extends Object

FarragoCardinalityEstimator estimates the number of distinct values in a population based on a sample of that population. The algorithms in this class are adapted from "Estimating the Number of Classes in a Finite Population" by Peter J. Haas and Lynne Stokes (IBM Research Journal; RJ 10025; May 1996).

Author:
Stephan Zuercher

Nested Class Summary
private static class FarragoCardinalityEstimator.SparseLongArray
           
 
Field Summary
private  boolean assumeUniqueConstraint
          Controls whether f, f_small, B, n_small, and dn_small are computed.
private  List<Long> B
          The number of items in each class where the class appears greater than Duj2a_DIVIDER times.
private  long dn
          The number of classes (distinct values) in the sample.
private  long dn_small
          For computation of Duj2a, the number of classes (distinct values) that appear fewer than Duj2a_DIVIDER times.
private static int Duj2a_DIVIDER
           
private  FarragoCardinalityEstimator.SparseLongArray f
          Sparse array of class frequencies.
private  FarragoCardinalityEstimator.SparseLongArray f_small
          For computation of Duj2a, a sparse array of class frequencies.
private  long n
          The number of items in the sample.
private  long N
          The full population's size.
private  long n_small
          For computation of Duj2a, the number of items in the sample that appear fewer than Duj2a_DIVIDER times.
private  long nullClassSize
          Number of null values in the sample.
 
Constructor Summary
FarragoCardinalityEstimator(long populationSize, boolean assumeUniqueConstraint)
          Construct a FarragoCardinalityEstimator.
 
Method Summary
 void addSampleClass(long classSize, boolean isNullClass)
          Add a pre-aggregated sample of the population to the estimator.
private  long clamp(long value, long min, long max)
           
private  double compute_nj(double q, double _1q, double Nj)
           
private  long computeDsh3(double q)
           
private static double computeDuj1(long n, long dn, long f1, double q)
           
private static long computeDuj2(long n, long dn, long f1, double q, double ysqD)
           
private  long computeDuj2a(long f1, double q, double ysqD)
           
 long estimate()
          Estimates the cardinality of the population using the formulas described above.
 long estimateDistinctWithNullClass()
          Estimates the cardinality of the population with the assumption that all classes except the null class are distinct.
private  long estimateNj(long nj, double q)
           
private static double estimateVarianceFromD(long n, long N, FarragoCardinalityEstimator.SparseLongArray f, long D)
           
 long getNumSampleClasses()
          Returns the number of classes (distinct values) in the sample.
 long getSampleSize()
          Returns the sample size.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

Duj2a_DIVIDER

private static final int Duj2a_DIVIDER
See Also:
Constant Field Values

assumeUniqueConstraint

private final boolean assumeUniqueConstraint
Controls whether f, f_small, B, n_small, and dn_small are computed. Alters behavior of estimate().


N

private final long N
The full population's size.


n

private long n
The number of items in the sample.


dn

private long dn
The number of classes (distinct values) in the sample.


f

private FarragoCardinalityEstimator.SparseLongArray f
Sparse array of class frequencies. f[i] is the number of classes (distinct values) that appear exactly k times in the sample.


n_small

private long n_small
For computation of Duj2a, the number of items in the sample that appear fewer than Duj2a_DIVIDER times. Falls in the range [0, n].


dn_small

private long dn_small
For computation of Duj2a, the number of classes (distinct values) that appear fewer than Duj2a_DIVIDER times. Falls in the range [0, dn].


f_small

private FarragoCardinalityEstimator.SparseLongArray f_small
For computation of Duj2a, a sparse array of class frequencies. The maximum index in this array is Duj2a_DIVIDER - 1. May contain no entries.


B

private List<Long> B
The number of items in each class where the class appears greater than Duj2a_DIVIDER times. May be empty or have as many entries as calls to addSampleClass(long, boolean) (if all classes appear more than Duj2a_DIVIDER times.


nullClassSize

private long nullClassSize
Number of null values in the sample.

Constructor Detail

FarragoCardinalityEstimator

public FarragoCardinalityEstimator(long populationSize,
                                   boolean assumeUniqueConstraint)
Construct a FarragoCardinalityEstimator.

If the assumeUniqueConstraint flag is set, some intermediate values are not computed. The estimate() method will simply return either the (given) population size or the result of estimateDistinctWithNullClass() if a null class was seen. The methods estimateDistinctWithNullClass(), getSampleSize(), and getNumSampleClasses() continue to work as normal.

Parameters:
populationSize - the size of the full population
assumeUniqueConstraint - if true, some intermediate values are not calculated (see above)
Method Detail

addSampleClass

public void addSampleClass(long classSize,
                           boolean isNullClass)
Add a pre-aggregated sample of the population to the estimator. Pre-aggregated sample means that this method is called exactly once for each class (distinct value) in the sample, with the number of times the class apears in sample given as classSize. Classes need not be added in any particular order. The goal of this method is to compute the data structures necessary to estimate cardinality of the sample without requiring multiple passes through the sample itself.

Parameters:
classSize - the number of times the class appears
isNullClass - set true if this class has no value (may only be done once per sample)
Throws:
IllegalArgumentException - if classSize is zero or negative, or if multiple null classes are specified.

getSampleSize

public long getSampleSize()
Returns the sample size.


getNumSampleClasses

public long getNumSampleClasses()
Returns the number of classes (distinct values) in the sample.


estimateDistinctWithNullClass

public long estimateDistinctWithNullClass()
Estimates the cardinality of the population with the assumption that all classes except the null class are distinct.


estimate

public long estimate()
Estimates the cardinality of the population using the formulas described above.


clamp

private long clamp(long value,
                   long min,
                   long max)

estimateVarianceFromD

private static double estimateVarianceFromD(long n,
                                            long N,
                                            FarragoCardinalityEstimator.SparseLongArray f,
                                            long D)

computeDuj1

private static double computeDuj1(long n,
                                  long dn,
                                  long f1,
                                  double q)

computeDuj2

private static long computeDuj2(long n,
                                long dn,
                                long f1,
                                double q,
                                double ysqD)

computeDuj2a

private long computeDuj2a(long f1,
                          double q,
                          double ysqD)

estimateNj

private long estimateNj(long nj,
                        double q)

compute_nj

private double compute_nj(double q,
                          double _1q,
                          double Nj)

computeDsh3

private long computeDsh3(double q)