Reputation: 359
I am trying to implement a fancy SQL query but am having trouble with trying to execute the join and count.
I have a very long table of data :
author | group | id |
daniel | group1| 118
adam | group2| 126
harry | group1| 221
daniel | group2| 323
daniel | group2| 122
daniel | group5| 322
harry | group1| 222
harry | group1| 225
... ...
I want my output to look like:
author1 | author2 | intersection | union
daniel | adam | 2 | 3
daniel | harry| 2 | 11
adam | harry| 0 | 10
where intersection is defined as the # of groups where author1 & author2 have in common, and union = # of groups author1 + author - intersection.
I think the proper way to do this is by
table a left join b table on a.group == b.group
but i can't figure out how to do the aggregate count.
Thanks enter code here
Upvotes: 0
Views: 1569
Reputation: 413
CREATE OR REPLACE FUNCTION public.jaccard_similarity(IN vector anyarray)
RETURNS double precision[]
LANGUAGE 'plpgsql'
AS $BODY$
BEGIN
RETURN(select ARRAY(
select(
select (SELECT COUNT(*) FROM (select vector1 INTERSECT select vector2) as intersect_elements)/(SELECT COUNT(*) FROM(select vector1 UNION select vector2) as union_elements) from unnest($1,"TOPIC_VECTOR") as t(vector1,vector2))
from public.tbl_topic)
as score);
END;
$BODY$;
ALTER FUNCTION public.jaccard_similarity(anyarray)
OWNER TO postgres;
COMMENT ON FUNCTION public.jaccard_similarity(anyarray)
IS 'this function is used for calculating a jaccard similarity of input vector with all vector in databse.';
one can use this function for referenenec. Thank you.
Upvotes: 0
Reputation: 173046
“Jumping In” because 1) still don't see any answer 2) saw author's related question with BigQuery Tag
So, theoretically, below query would make your task (using bigquery-samples.reddit.full table for below examples):
BigQuery Legacy SQL:
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.subr = b.subr) AS count_intersection,
EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
BigQuery Standard SQL:
WITH subrs AS (
SELECT author, subr
FROM `bigquery-samples.reddit.full`
GROUP BY 1, 2
)
SELECT
a.author AS author1,
b.author AS author2,
COUNTIF(a.subr = b.subr) AS count_intersection,
COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a
JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
If you will try to run them, you most likely to get below error
An internal error occurred and the request could not be completed
The reason is because each of those two queries produces about trillion rows as a result of join (see below stats). There are many ways to address this – below proposed way is to address this by tuning requirements. Do you really need to involved into algorithms light authors with let’s say just one or two subreddits? Or – do you really want to find similarity between those who have just very few comments in specific subreddits?
See below, how introducing extra limits helps in executing above queries (note: lines
is min limit count of entries per author per subr and subrs
is min limit of number of subr per user)
Below is version that actually produces result w/o any type of failure:
Standard SQL
WITH authors AS (
SELECT author FROM (
SELECT author, COUNT(1) AS subrs FROM (
SELECT author, subr, COUNT(1) AS lines
FROM `bigquery-samples.reddit.full`
GROUP BY 1, 2
HAVING lines > 1
)
GROUP BY author
HAVING subrs > 3
)
),
subrs AS (
SELECT author, subr
FROM `bigquery-samples.reddit.full`
WHERE author IN (SELECT author FROM authors)
GROUP BY 1, 2
)
SELECT
a.author AS author1,
b.author AS author2,
COUNTIF(a.subr = b.subr) AS count_intersection,
COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
In similar way you can adjust Legacy SQL to make it work
This might be not the best way - but at least gives some hope on such tasks to be able to easily run within BigQuery, w/o going to other workarounds
Upvotes: 1