Reputation: 359
My table looks as follows:
author | group
daniel | group1,group2,group3,group4,group5,group8,group10
adam | group2,group5,group11,group12
harry | group1,group10,group15,group13,group15,group18
...
...
I want my output to look like:
author1 | author2 | intersection | union
daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam | harry| 0 | 10
THANK YOU
Upvotes: 5
Views: 5523
Reputation: 11
I propose this option that scales better:
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, grp
FROM YourTable, UNNEST(SPLIT(grp)) as grp
),
intersection AS (
SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
FROM tempTable a
JOIN tempTable b
USING (grp)
WHERE a.author > b.author
GROUP BY a.author, b.author
),
count_distinct_groups AS (
SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
FROM tempTable
GROUP BY author
),
join_it AS (
SELECT
intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
FROM
intersection
JOIN
count_distinct_groups cg1
ON
intersection.author1 = cg1.author
JOIN
count_distinct_groups cg2
ON
intersection.author2 = cg2.author
)
SELECT
*,
count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
join_it
A full cross join on Big Data (tens of thousands x millions) fails for too much shuffling while the second proposal takes hours to execute. That one takes minutes.
The consequence of this approach though is that pairs having no intersection will not appear, so it will be the responsibility of the process that uses it to handle IFNULL.
Last detail: the union on Daniel and Harry is 10 rather than 11 as group15 is repeated in the initial example.
Upvotes: 1
Reputation: 2630
Inspired by Mikhail Berlyant's second answer, here is essentially the same method reformatted for Presto (as another example for a different flavor of SQL). Again all credit to Mikhail for this one.
WITH
YourTable AS (
SELECT
'daniel' AS author,
'group1,group2,group3,group4,group5,group8,group10' AS grp
UNION ALL
SELECT
'adam' AS author,
'group2,group5,group11,group12' AS grp
UNION ALL
SELECT
'harry' AS author,
'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT
author,
SPLIT(grp, ',') AS grp
FROM
YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
CARDINALITY(a.grp) AS count1,
CARDINALITY(b.grp) AS count2,
CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count,
CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count
FROM tempTable a
JOIN tempTable b ON a.author < b.author
;
Note that this will give slightly different counts for harry
as well as the union_count
as it only counts unique entries, e.g. harry
has two group15
values, but only one will be counted:
author1 | author2 | count1 | count2 | intersection_count | union_count
---------+---------+--------+--------+--------------------+-------------
daniel | harry | 7 | 5 | 2 | 10
adam | harry | 4 | 5 | 0 | 9
adam | daniel | 4 | 7 | 2 | 9
Upvotes: 0
Reputation: 173046
Try below (for BigQuery)
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.item=b.item) AS intersection,
EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author
GROUP BY 1,2
Added solution for BigQuery Standard SQL
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, SPLIT(grp) AS grp
FROM YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
(SELECT COUNT(1) FROM a.grp) AS count1,
(SELECT COUNT(1) FROM b.grp) AS count2,
(SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
(SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author
What I like about this one:
When/If try - make sure to uncheck Use Legacy SQL
checkbox under Show Options
Upvotes: 5