Reputation: 1127
I have a 160GB database in BigQuery that represents events sent from apps in the last month. Each row has (among other) date field, userId field and event field.
What I need to do is to find how many unique users have used the app on certain day. For example, if user X used app three times on 2017/09/15, count him only once for that day, but if he used the app on both 2017/09/15, 2017/09/16 and 2017/09/17, count him in each of those days (only once).
What I have tried to do is group users by userId and by date and then separately count number of userIds for each date, for example:
SELECT
userId,
DATE(date) AS date_conv
FROM
[TABLE]
WHERE
userId IS NOT NULL
GROUP BY
date_conv,
userId
ORDER BY
userId
However, this query returns "Resources exceeded", even with increased billing tier, so it would be great if there would be a more efficient way.
Upvotes: 1
Views: 10836
Reputation: 1269633
If you want to count the number of unique users, then use count(distinct)
:
SELECT DATE(date) AS date_conv, COUNT(DISTINCT userId) as numusers
FROM [TABLE]
GROUP BY DATE(date)
ORDER BY DATE(date);
You are probably getting the resources exceeded because of the ORDER BY
.
EDIT:
I haven't tried COUNT(DISTINCT)
on a really large volume of data on BigQuery. In some databases, nested aggregations optimizes much better:
SELECT DATE(date) AS date_conv, COUNT(*) as numusers
FROM (SELECT DATE(date) AS date_conv, userId
FROM [TABLE]
GROUP BY DATE(date), userId
) du
GROUP BY DATE(date)
ORDER BY DATE(date);
Upvotes: 4