SQL- jaccard similarity

My table looks as follows:

author | group 

daniel | group1,group2,group3,group4,group5,group8,group10
adam   | group2,group5,group11,group12
harry  | group1,group10,group15,group13,group15,group18
...
...

I want my output to look like:

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam   | harry| 0 | 10

THANK YOU

Upvotes: 5

Views: 5523

Answers (3)

Laurent Candillier
Laurent Candillier

Reputation: 11

I propose this option that scales better:

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),

tempTable AS (
  SELECT author, grp
  FROM YourTable, UNNEST(SPLIT(grp)) as grp
),

intersection AS (
  SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
  FROM tempTable a 
  JOIN tempTable b
  USING (grp)
  WHERE a.author > b.author
  GROUP BY a.author, b.author
),

count_distinct_groups AS (
  SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
  FROM tempTable
  GROUP BY author
),

join_it AS (
  SELECT
    intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
  FROM
    intersection
  JOIN
    count_distinct_groups cg1
  ON
    intersection.author1 = cg1.author
  JOIN
    count_distinct_groups cg2
  ON
    intersection.author2 = cg2.author
)

SELECT
  *,
  count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
  intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
  join_it

A full cross join on Big Data (tens of thousands x millions) fails for too much shuffling while the second proposal takes hours to execute. That one takes minutes.

The consequence of this approach though is that pairs having no intersection will not appear, so it will be the responsibility of the process that uses it to handle IFNULL.

Last detail: the union on Daniel and Harry is 10 rather than 11 as group15 is repeated in the initial example.

Upvotes: 1

wingr
wingr

Reputation: 2630

Inspired by Mikhail Berlyant's second answer, here is essentially the same method reformatted for Presto (as another example for a different flavor of SQL). Again all credit to Mikhail for this one.

WITH
YourTable AS (
    SELECT
        'daniel' AS author,
        'group1,group2,group3,group4,group5,group8,group10' AS grp
    UNION ALL
    SELECT
        'adam' AS author,
        'group2,group5,group11,group12' AS grp
    UNION ALL
    SELECT
        'harry' AS author,
        'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
    SELECT
        author,
        SPLIT(grp, ',') AS grp
    FROM
        YourTable
)
SELECT
  a.author AS author1,
  b.author  AS author2,
  CARDINALITY(a.grp) AS count1,
  CARDINALITY(b.grp) AS count2,
  CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count,
  CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count
FROM tempTable a
JOIN tempTable b ON a.author < b.author
;

Note that this will give slightly different counts for harry as well as the union_count as it only counts unique entries, e.g. harry has two group15 values, but only one will be counted:

 author1 | author2 | count1 | count2 | intersection_count | union_count
---------+---------+--------+--------+--------------------+-------------
 daniel  | harry   |      7 |      5 |                  2 |          10
 adam    | harry   |      4 |      5 |                  0 |           9
 adam    | daniel  |      4 |      7 |                  2 |           9

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

Try below (for BigQuery)

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2

Added solution for BigQuery Standard SQL

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author

What I like about this one:

  • much simpler / friendlier code
  • no CROSS JOIN and extra GROUP BY needed

When/If try - make sure to uncheck Use Legacy SQL checkbox under Show Options

Upvotes: 5

Related Questions