CHP
CHP

Reputation: 950

Count rows that don't match between two tables within a group

I have two tables, each with a user_id, and group_name columns.

E.g.

table1:

| user_id | group_name1|
------------------------
|    1    |   'groupA' |
|    1    |   'groupB' |
|    2    |   'groupA' |
|    1    |   'groupA' |
------------------------


table2:

| user_id | group_name2|
------------------------
|    1    |   'groupL' |
|    2    |   'groupL' |
|    3    |   'groupL' |
|    4    |   'groupN' |
|    1    |   'groupN' |
|    3    |   'groupN' |
------------------------

I'm trying to create a distribution of counts between the number of times a user in table2 shows up in table1, but do this within a group.

For the example above, I'd get

| times_show_up | number_of_users | group_name1 | group_name2 |
---------------------------------------------------------------
|      0        |       1         |    groupA   |    groupL    |
|      1        |       1         |    groupA   |    groupL    |
|      2        |       1         |    groupA   |    groupL    |
|      0        |       2         |    groupB   |    groupL    |
|      1        |       1         |    groupB   |    groupL    |
|      2        |       0         |    groupB   |    groupL    |
|      0        |       2         |    groupA   |    groupN    |
|      1        |       0         |    groupA   |    groupN    |
|      2        |       1         |    groupA   |    groupN    |
|      0        |       2         |    groupB   |    groupN    |
|      1        |       1         |    groupB   |    groupN    |
|      2        |       0         |    groupB   |    groupN    |
----------------------------------------------------------------

To explain some rows as further example:

For row 1, the user with user_id = 3 in groupL shows up 0 times in groupA. For row 2, the user with user_id = 2 in groupL shows up once in groupA. And for row 3, the user with user_id = 1 in groupL shows up twice in groupA.

Although in this example a person shows up up to a maximum of 2 times, in the real data this number is some arbitrarily large number that I don't know ahead of time.

Similarly for the other groups if I filled all that out correctly.

I've come up with a query that can do all of this except counting the 0s, that looks like this:

    SELECT 
        COUNT(user_id) AS num_users,
        times_show_up,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        RIGHT JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2)
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    )
    GROUP BY times_show_up, group_name1, group_name2

Unfortunately this won't return the 0 counts in the times_show_up column, and I haven't figured out a solution that can accomplish this without a lot of subqueries. One possible method is to just run subqueries to grab all the 0s for all combinations of all groups, and then just UNION those rows to the rest of the table. But I want to avoid approaches that include subqueries for each possible group1, group2 combination since the number of groups is very large.

Some limitations include that partition by on this data set tends to run out of memory so I want to avoid it. Updated requirement: In addition, using CROSS JOIN at the individual user level (so cross joining table1 direclty to table2 without grouping up rows first) doesn't work because each table has 10s of millions of rows.

Finally, rows with 0s in the number_of_users column don't have to show up (it's fine if they do since they can be removed with a simple WHERE, but are not necessary, if it helps with the query)


UPDATE:

I was able to come up with a query that can generate the zeros while only requiring a single query for each group_name1, rather than a single query for each group_name1, group_name2 combination. I'm adding it to the question in case it helps come up with answers with yet fewer queries, since it's still the case that the number of groups in table 1 could be upwards of 20+ which means 20+ queries added via UNION ALL.

SELECT * FROM
    (SELECT 
        times_show_up,
        COUNT(user_id) AS num_users,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        INNER JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2) t2
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    ) t1
    GROUP BY times_show_up, group_name1, group_name2) t9
    UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupA' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1
     USING(user_id)
     GROUP BY group_name2)
     UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupB' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1
     USING(user_id)
     GROUP BY group_name2)
     --- ORDER BY group_name1, group_name2, times_show_up

Upvotes: 1

Views: 790

Answers (3)

CHP
CHP

Reputation: 950

The answer by @Mikhail Berlyant meets the original requirements of my question. Unfortunately because it relies on a cross join at the user_id level, and there are 10s of millions of user IDs, it takes long to complete for my specific use case. So I'm providing the following answer, which is faster, but does require an additional query for each group in table 1 (but not for each combination of group1 and group2), making for a less concise query that can potentially balloon beyond the limits of BigQuery query size if the number of groups is very very large.

This approach is preferred if you can programmatically generate the queries for each group and have fewer groups with millions of users, whereas the answer by @Mikhail Berlyant should work well for situations when there are many more groups with a small number of users each, and in cases where the query generation is not done programmatically and you have to write each one for each group.

SELECT * FROM
    (SELECT 
        times_show_up,
        COUNT(user_id) AS num_users,
        group_name1,
        group_name2
    FROM
    (
        SELECT 
            user_id, 
            COUNT(*) AS times_show_up,
            group_name1, 
            group_name2
        FROM
            table1
        INNER JOIN
            (SELECT DISTINCT user_id, group_name2 FROM table2) t2
        USING(user_id)
        GROUP BY user_id, group_name1, group_name2
    ) t1
    GROUP BY times_show_up, group_name1, group_name2) t9
    # Each subsequent query being UNIONed corresponds to a group in table 1
    UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupA' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupA') t1
     USING(user_id)
     GROUP BY group_name2)
     UNION ALL
    (SELECT
       0 AS times_show_up,
       SUM(CASE WHEN t1.user_id IS NULL 
           THEN 1 ELSE 0 END) AS num_users,
       'groupB' AS group_name1,
       group_name2
     FROM
       table2
     LEFT JOIN
       (SELECT user_id FROM table1 WHERE group_name1 = 'groupB') t1
     USING(user_id)
     GROUP BY group_name2)
     --- ORDER BY group_name1, group_name2, times_show_up```

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173161

Below is for BigQuery Standard SQL and ended up being relatively simple

#standardSQL
SELECT times_show_up, 
  COUNT(DISTINCT user_id) number_of_users, 
  group_name1, group_name2
FROM (
  SELECT COUNTIF(a.user_id = b.user_id) times_show_up, 
    b.user_id, 
    group_name1, group_name2
  FROM table1 a
  CROSS JOIN table2 b
  GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2
-- ORDER BY group_name2, group_name1, times_show_up   

If to apply to sample data from your question - result is

Row times_show_up   number_of_users group_name1 group_name2  
1   0               1               groupA      groupL   
2   1               1               groupA      groupL   
3   2               1               groupA      groupL   
4   0               2               groupB      groupL   
5   1               1               groupB      groupL   
6   0               2               groupA      groupN   
7   2               1               groupA      groupN   
8   0               2               groupB      groupN   
9   1               1               groupB      groupN     

... rows with 0s in the number_of_users column don't have to show up

Note: I follow this rule as looks like you plan to eliminate them anyway in case if result has such

Update for ... each table has 10s of millions of rows.

Try below "optimized" version

#standardSQL
SELECT times_show_up, 
  COUNT(DISTINCT user_id) number_of_users, 
  group_name1, group_name2
FROM (
  SELECT SUM(IF(a.user_id = b.user_id, cnt, 0)) times_show_up, 
    b.user_id, 
    group_name1, group_name2
  FROM (SELECT user_id, group_name1, COUNT(1) cnt FROM table1 GROUP BY user_id, group_name1) a
  CROSS JOIN (SELECT DISTINCT user_id, group_name2 FROM table2) b
  GROUP BY user_id, group_name1, group_name2
)
GROUP BY times_show_up, group_name1, group_name2

I don't have relevant data to test though whether this will help or not for your specific data

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270763

Here is the strategy.

  1. Use cross join to generate the rows.
  2. For this purpose, get the groups using count(distinct).
  3. Generate the times_show_up using a derived table.
  4. Aggregate table1 and table2
  5. Join this all together.

Here is the query:

select g1.group_name1, g2.group_name2, tsu.times_show_up,
       coalesce(t12.cnt, 0) as num_users
from (select distinct group_name1 from table1) g1 cross join
     (select distinct group_name2 from table2) t2 cross join
     (select 0 as times_show_up union all
      select 1 union all
      select 2
     ) tsu left join
     (select t1.group_name1, t2.group_name2, count(*) as cnt
      from table1 t1 join
           table2 t2
           on t2.user_id = t1.user_id
      group by t1.group_name1, t2.group_name2
     ) t12
     on t12.group_name1 = g1.group_name1 and
        t12.group_name2 = g2.group_name2 and
        t12.cnt = tsu.times_show_up;

You may want count(distinct user_id) instead of count(*) in the subquery, if your data really does have duplicates.

Upvotes: 0

Related Questions