Fardream
Fardream

Reputation: 662

BigQuery COUNT DISTINCT estimation error

I understand that BigQuery is providing an estimation of COUNT DISTINCT, but is there any information on how big the error is and what kind of parameters it depends on?

Thanks

Upvotes: 3

Views: 325

Answers (1)

Mosha Pasumansky
Mosha Pasumansky

Reputation: 14014

The accuracy of COUNT DISTINCT estimation depends on real number of distict values. If it is small - the algorithm is pretty accurate (for small values it usually returns the exact value), but the bigger number of distinct values is - the less accurate it can become. Note, that COUNT(DISTINCT) takes second argument, which trades memory for accuracy, i.e. it will use more memory, but be more accurate. For example:

SELECT COUNT(DISTINCT x, 100000) FROM T

will return fairly accurate results if total number of distict values is less than 100,000.

The exact algorithm for COUNT distinct estimate varies, but different variations have similar error estimate - about 1/SQRT(N), where N is the second argument. Default value is 1000, which corresponds to about 3% error. If bumped to 10000 it would be about 1% error.

Upvotes: 4

Related Questions