SQL REDDIT - Jaccard Similarity

I am trying to implement a fancy SQL query but am having trouble with trying to execute the join and count.

I have a very long table of data :

author | group | id |

daniel | group1| 118
adam   | group2| 126
harry  | group1| 221
daniel | group2| 323
daniel | group2| 122
daniel | group5| 322
harry  | group1| 222 
harry  | group1| 225

... ...

I want my output to look like:

author1 | author2 | intersection | union

daniel | adam | 2 | 3
daniel | harry| 2 | 11
adam   | harry| 0 | 10

where intersection is defined as the # of groups where author1 & author2 have in common, and union = # of groups author1 + author - intersection.

I think the proper way to do this is by

table a left join b table on a.group == b.group

but i can't figure out how to do the aggregate count.

Thanks enter code here

Upvotes: 0

Views: 1569

Answers (2)

Gaurav Koradiya
Gaurav Koradiya

Reputation: 413

CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
    RETURNS double precision[]
    LANGUAGE 'plpgsql'

AS $BODY$
BEGIN
    RETURN(select ARRAY(
            select(
                select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2)) 
            from public.tbl_topic) 
            as score);

END;
$BODY$;

ALTER FUNCTION public.jaccard_similarity(anyarray)
    OWNER TO postgres;

COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
    IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';

one can use this function for referenenec. Thank you.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

“Jumping In” because 1) still don't see any answer 2) saw author's related question with BigQuery Tag

So, theoretically, below query would make your task (using bigquery-samples.reddit.full table for below examples):

BigQuery Legacy SQL:

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.subr = b.subr) AS count_intersection,
  EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN 
  (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

BigQuery Standard SQL:

WITH subrs AS (
  SELECT author, subr 
  FROM `bigquery-samples.reddit.full` 
  GROUP BY 1, 2
)
SELECT
  a.author AS author1, 
  b.author AS author2, 
  COUNTIF(a.subr = b.subr) AS count_intersection,
  COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a 
JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

If you will try to run them, you most likely to get below error

An internal error occurred and the request could not be completed

The reason is because each of those two queries produces about trillion rows as a result of join (see below stats). There are many ways to address this – below proposed way is to address this by tuning requirements. Do you really need to involved into algorithms light authors with let’s say just one or two subreddits? Or – do you really want to find similarity between those who have just very few comments in specific subreddits?

See below, how introducing extra limits helps in executing above queries (note: lines is min limit count of entries per author per subr and subrs is min limit of number of subr per user)

enter image description here

Below is version that actually produces result w/o any type of failure:

Standard SQL

WITH authors AS (
  SELECT author FROM (
    SELECT author, COUNT(1) AS subrs FROM (
      SELECT author, subr, COUNT(1) AS lines 
      FROM `bigquery-samples.reddit.full` 
      GROUP BY 1, 2
      HAVING lines > 1
    ) 
    GROUP BY author
    HAVING subrs > 3
  )
),
subrs AS (
  SELECT author, subr 
  FROM `bigquery-samples.reddit.full` 
  WHERE author IN (SELECT author FROM authors)
  GROUP BY 1, 2
)
SELECT
  a.author AS author1, 
  b.author AS author2, 
  COUNTIF(a.subr = b.subr) AS count_intersection,
  COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100

In similar way you can adjust Legacy SQL to make it work

This might be not the best way - but at least gives some hope on such tasks to be able to easily run within BigQuery, w/o going to other workarounds

Upvotes: 1

Related Questions