|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object net.sf.farrago.util.FarragoCardinalityEstimator
public class FarragoCardinalityEstimator
FarragoCardinalityEstimator estimates the number of distinct values in a population based on a sample of that population. The algorithms in this class are adapted from "Estimating the Number of Classes in a Finite Population" by Peter J. Haas and Lynne Stokes (IBM Research Journal; RJ 10025; May 1996).
Nested Class Summary | |
---|---|
private static class |
FarragoCardinalityEstimator.SparseLongArray
|
Field Summary | |
---|---|
private boolean |
assumeUniqueConstraint
Controls whether f , f_small , B , n_small , and dn_small are computed. |
private List<Long> |
B
The number of items in each class where the class appears greater than Duj2a_DIVIDER times. |
private long |
dn
The number of classes (distinct values) in the sample. |
private long |
dn_small
For computation of Duj2a, the number of classes (distinct values) that appear fewer than Duj2a_DIVIDER times. |
private static int |
Duj2a_DIVIDER
|
private FarragoCardinalityEstimator.SparseLongArray |
f
Sparse array of class frequencies. |
private FarragoCardinalityEstimator.SparseLongArray |
f_small
For computation of Duj2a, a sparse array of class frequencies. |
private long |
n
The number of items in the sample. |
private long |
N
The full population's size. |
private long |
n_small
For computation of Duj2a, the number of items in the sample that appear fewer than Duj2a_DIVIDER times. |
private long |
nullClassSize
Number of null values in the sample. |
Constructor Summary | |
---|---|
FarragoCardinalityEstimator(long populationSize,
boolean assumeUniqueConstraint)
Construct a FarragoCardinalityEstimator. |
Method Summary | |
---|---|
void |
addSampleClass(long classSize,
boolean isNullClass)
Add a pre-aggregated sample of the population to the estimator. |
private long |
clamp(long value,
long min,
long max)
|
private double |
compute_nj(double q,
double _1q,
double Nj)
|
private long |
computeDsh3(double q)
|
private static double |
computeDuj1(long n,
long dn,
long f1,
double q)
|
private static long |
computeDuj2(long n,
long dn,
long f1,
double q,
double ysqD)
|
private long |
computeDuj2a(long f1,
double q,
double ysqD)
|
long |
estimate()
Estimates the cardinality of the population using the formulas described above . |
long |
estimateDistinctWithNullClass()
Estimates the cardinality of the population with the assumption that all classes except the null class are distinct. |
private long |
estimateNj(long nj,
double q)
|
private static double |
estimateVarianceFromD(long n,
long N,
FarragoCardinalityEstimator.SparseLongArray f,
long D)
|
long |
getNumSampleClasses()
Returns the number of classes (distinct values) in the sample. |
long |
getSampleSize()
Returns the sample size. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private static final int Duj2a_DIVIDER
private final boolean assumeUniqueConstraint
f
, f_small
, B
, n_small
, and dn_small
are computed. Alters behavior of estimate()
.
private final long N
private long n
private long dn
private FarragoCardinalityEstimator.SparseLongArray f
private long n_small
Duj2a_DIVIDER
times. Falls in the range [0, n].
private long dn_small
Duj2a_DIVIDER
times. Falls in the range [0,
dn].
private FarragoCardinalityEstimator.SparseLongArray f_small
Duj2a_DIVIDER
- 1. May contain no
entries.
private List<Long> B
Duj2a_DIVIDER
times. May be empty or have as many entries as
calls to addSampleClass(long, boolean)
(if all classes appear
more than Duj2a_DIVIDER times.
private long nullClassSize
Constructor Detail |
---|
public FarragoCardinalityEstimator(long populationSize, boolean assumeUniqueConstraint)
If the assumeUniqueConstraint flag is set, some intermediate
values are not computed. The estimate()
method will simply
return either the (given) population size or the result of estimateDistinctWithNullClass()
if a null class was seen. The methods
estimateDistinctWithNullClass()
, getSampleSize()
, and
getNumSampleClasses()
continue to work as normal.
populationSize
- the size of the full populationassumeUniqueConstraint
- if true, some intermediate values are not
calculated (see above)Method Detail |
---|
public void addSampleClass(long classSize, boolean isNullClass)
classSize
- the number of times the class appearsisNullClass
- set true if this class has no value (may only be done
once per sample)
IllegalArgumentException
- if classSize is zero or negative, or if
multiple null classes are specified.public long getSampleSize()
public long getNumSampleClasses()
public long estimateDistinctWithNullClass()
public long estimate()
above
.
private long clamp(long value, long min, long max)
private static double estimateVarianceFromD(long n, long N, FarragoCardinalityEstimator.SparseLongArray f, long D)
private static double computeDuj1(long n, long dn, long f1, double q)
private static long computeDuj2(long n, long dn, long f1, double q, double ysqD)
private long computeDuj2a(long f1, double q, double ysqD)
private long estimateNj(long nj, double q)
private double compute_nj(double q, double _1q, double Nj)
private long computeDsh3(double q)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |