Reputation: 662
I understand that BigQuery is providing an estimation of COUNT DISTINCT, but is there any information on how big the error is and what kind of parameters it depends on?
Thanks
Upvotes: 3
Views: 325
Reputation: 14014
The accuracy of COUNT DISTINCT estimation depends on real number of distict values. If it is small - the algorithm is pretty accurate (for small values it usually returns the exact value), but the bigger number of distinct values is - the less accurate it can become. Note, that COUNT(DISTINCT) takes second argument, which trades memory for accuracy, i.e. it will use more memory, but be more accurate. For example:
SELECT COUNT(DISTINCT x, 100000) FROM T
will return fairly accurate results if total number of distict values is less than 100,000.
The exact algorithm for COUNT distinct estimate varies, but different variations have similar error estimate - about 1/SQRT(N), where N is the second argument. Default value is 1000, which corresponds to about 3% error. If bumped to 10000 it would be about 1% error.
Upvotes: 4