Reputation: 950
I have two tables, each with a user_id
, and group_name
columns.
E.g.
table1:
| user_id | group_name1|
------------------------
| 1 | 'groupA' |
| 1 | 'groupB' |
| 2 | 'groupA' |
| 1 | 'groupA' |
------------------------
table2:
| user_id | group_name2|
------------------------
| 1 | 'groupL' |
| 2 | 'groupL' |
| 3 | 'groupL' |
| 4 | 'groupN' |
| 1 | 'groupN' |
| 3 | 'groupN' |
------------------------
I'm trying to create a distribution of counts between the number of times a user in table2 shows up in table1, but do this within a group.
For the example above, I'd get
| times_show_up | number_of_users | group_name1 | group_name2 |
---------------------------------------------------------------
| 0 | 1 | groupA | groupL |
| 1 | 1 | groupA | groupL |
| 2 | 1 | groupA | groupL |
| 0 | 2 | groupB | groupL |
| 1 | 1 | groupB | groupL |
| 2 | 0 | groupB | groupL |
| 0 | 2 | groupA | groupN |
| 1 | 0 | groupA | groupN |
| 2 | 1 | groupA | groupN |
| 0 | 2 | groupB | groupN |
| 1 | 1 | groupB | groupN |
| 2 | 0 | groupB | groupN |
----------------------------------------------------------------
To explain some rows as further example:
For row 1, the user with user_id = 3 in groupL shows up 0 times in groupA. For row 2, the user with user_id = 2 in groupL shows up once in groupA. And for row 3, the user with user_id = 1 in groupL shows up twice in groupA.
Although in this example a person shows up up to a maximum of 2 times, in the real data this number is some arbitrarily large number that I don't know ahead of time.
Similarly for the other groups if I filled all that out correctly.
I've come up with a query that can do all of this except counting the 0s, that looks like this:
SELECT
COUNT(user_id) AS num_users,
times_show_up,
group_name1,
group_name2
FROM
(
SELECT
user_id,
COUNT(*) AS times_show_up,
group_name1,
group_name2
FROM
table1
RIGHT JOIN
(SELECT DISTINCT user_id, group_name2 FROM table2)
USING(user_id)
GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2
Unfortunately this won't return the 0 counts in the times_show_up
column, and I haven't figured out a solution that can accomplish this without a lot of subqueries. One possible method is to just run subqueries to grab all the 0s for all combinations of all groups, and then just UNION
those rows to the rest of the table. But I want to avoid approaches that include subqueries for each possible group1, group2 combination since the number of groups is very large.
Some limitations include that partition by
on this data set tends to run out of memory so I want to avoid it. Updated requirement: In addition, using CROSS JOIN at the individual user level (so cross joining table1 direclty to table2 without grouping up rows first) doesn't work because each table has 10s of millions of rows.
Finally, rows with 0s in the number_of_users
column don't have to show up (it's fine if they do since they can be removed with a simple WHERE
, but are not necessary, if it helps with the query)
UPDATE:
I was able to come up with a query that can generate the zeros while only requiring a single query for each group_name1, rather than a single query for each group_name1, group_name2 combination. I'm adding it to the question in case it helps come up with answers with yet fewer queries, since it's still the case that the number of groups in table 1 could be upwards of 20+ which means 20+ queries added via UNION ALL
.
SELECT * FROM
(SELECT
times_show_up,
COUNT(user_id) AS num_users,
group_name1,
group_name2
FROM
(
SELECT
user_id,
COUNT(*) AS times_show_up,
group_name1,
group_name2
FROM
table1
INNER JOIN
(SELECT DISTINCT user_id, group_name2 FROM table2) t2
USING(user_id)
GROUP BY user_id, group_name1, group_name2
) t1
GROUP BY times_show_up, group_name1, group_name2) t9
UNION ALL
(SELECT
0 AS times_show_up,
SUM(CASE WHEN t1.user_id IS NULL
THEN 1 ELSE 0 END) AS num_users,
'groupA' AS group_name1,
group_name2
FROM
table2
LEFT JOIN
(SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1
USING(user_id)
GROUP BY group_name2)
UNION ALL
(SELECT
0 AS times_show_up,
SUM(CASE WHEN t1.user_id IS NULL
THEN 1 ELSE 0 END) AS num_users,
'groupB' AS group_name1,
group_name2
FROM
table2
LEFT JOIN
(SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1
USING(user_id)
GROUP BY group_name2)
--- ORDER BY group_name1, group_name2, times_show_up
Upvotes: 1
Views: 790
Reputation: 950
The answer by @Mikhail Berlyant meets the original requirements of my question. Unfortunately because it relies on a cross join at the user_id level, and there are 10s of millions of user IDs, it takes long to complete for my specific use case. So I'm providing the following answer, which is faster, but does require an additional query for each group in table 1 (but not for each combination of group1 and group2), making for a less concise query that can potentially balloon beyond the limits of BigQuery query size if the number of groups is very very large.
This approach is preferred if you can programmatically generate the queries for each group and have fewer groups with millions of users, whereas the answer by @Mikhail Berlyant should work well for situations when there are many more groups with a small number of users each, and in cases where the query generation is not done programmatically and you have to write each one for each group.
SELECT * FROM
(SELECT
times_show_up,
COUNT(user_id) AS num_users,
group_name1,
group_name2
FROM
(
SELECT
user_id,
COUNT(*) AS times_show_up,
group_name1,
group_name2
FROM
table1
INNER JOIN
(SELECT DISTINCT user_id, group_name2 FROM table2) t2
USING(user_id)
GROUP BY user_id, group_name1, group_name2
) t1
GROUP BY times_show_up, group_name1, group_name2) t9
# Each subsequent query being UNIONed corresponds to a group in table 1
UNION ALL
(SELECT
0 AS times_show_up,
SUM(CASE WHEN t1.user_id IS NULL
THEN 1 ELSE 0 END) AS num_users,
'groupA' AS group_name1,
group_name2
FROM
table2
LEFT JOIN
(SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1
USING(user_id)
GROUP BY group_name2)
UNION ALL
(SELECT
0 AS times_show_up,
SUM(CASE WHEN t1.user_id IS NULL
THEN 1 ELSE 0 END) AS num_users,
'groupB' AS group_name1,
group_name2
FROM
table2
LEFT JOIN
(SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1
USING(user_id)
GROUP BY group_name2)
--- ORDER BY group_name1, group_name2, times_show_up```
Upvotes: 0
Reputation: 173161
Below is for BigQuery Standard SQL and ended up being relatively simple
#standardSQL
SELECT times_show_up,
COUNT(DISTINCT user_id) number_of_users,
group_name1, group_name2
FROM (
SELECT COUNTIF(a.user_id = b.user_id) times_show_up,
b.user_id,
group_name1, group_name2
FROM table1 a
CROSS JOIN table2 b
GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2
-- ORDER BY group_name2, group_name1, times_show_up
If to apply to sample data from your question - result is
Row times_show_up number_of_users group_name1 group_name2
1 0 1 groupA groupL
2 1 1 groupA groupL
3 2 1 groupA groupL
4 0 2 groupB groupL
5 1 1 groupB groupL
6 0 2 groupA groupN
7 2 1 groupA groupN
8 0 2 groupB groupN
9 1 1 groupB groupN
... rows with 0s in the number_of_users column don't have to show up
Note: I follow this rule as looks like you plan to eliminate them anyway in case if result has such
Update for ... each table has 10s of millions of rows.
Try below "optimized" version
#standardSQL
SELECT times_show_up,
COUNT(DISTINCT user_id) number_of_users,
group_name1, group_name2
FROM (
SELECT SUM(IF(a.user_id = b.user_id, cnt, 0)) times_show_up,
b.user_id,
group_name1, group_name2
FROM (SELECT user_id, group_name1, COUNT(1) cnt FROM table1 GROUP BY user_id, group_name1) a
CROSS JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) b
GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2
I don't have relevant data to test though whether this will help or not for your specific data
Upvotes: 1
Reputation: 1270763
Here is the strategy.
cross join
to generate the rows.count(distinct)
.times_show_up
using a derived table.table1
and table2
Here is the query:
select g1.group_name1, g2.group_name2, tsu.times_show_up,
coalesce(t12.cnt, 0) as num_users
from (select distinct group_name1 from table1) g1 cross join
(select distinct group_name2 from table2) t2 cross join
(select 0 as times_show_up union all
select 1 union all
select 2
) tsu left join
(select t1.group_name1, t2.group_name2, count(*) as cnt
from table1 t1 join
table2 t2
on t2.user_id = t1.user_id
group by t1.group_name1, t2.group_name2
) t12
on t12.group_name1 = g1.group_name1 and
t12.group_name2 = g2.group_name2 and
t12.cnt = tsu.times_show_up;
You may want count(distinct user_id)
instead of count(*)
in the subquery, if your data really does have duplicates.
Upvotes: 0