Reputation: 43
I'm creating a table of the number of overlapping commentators between Reddit subreddits via the following self join:
SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;
Typical queries I have made on this data set in Big Query finish very quickly (< 1 min) but this query has run for over an hour and still not finished. This data has 54,504,410 rows and 22 columns.
Am I missing obvious speed ups that I should be implementing to make this query run quickly? Thanks!
Upvotes: 4
Views: 5917
Reputation: 172993
try below
SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit
It does two things
First, it pre-aggregates data to avoid redundant joining
Second, it eliminates "potential outliers" - those authors who have just one post for subreddit. Of course, second item depends on your use case. But most likely it should be ok and thus resolving performance issue. if still slower than you expected - increase threshold to 2 or greater
Follow up: ... 22,545,850,104 ... seems incorrect ... Should it be SUM(t1.cnt+t2.cnt)?
Sure it is incorrect, but it is exactly what you would get if you were able to run your query in question!
And my hope was that you will be able to catch this!
So, I am glad that fixing “performance” issue – opened your eyes on issue with logic in your original query!
So, yes, obviously 22,545,850,104 is incorrect number.
So, instead of
SUM(t1.cnt*t2.cnt) as NumOverlaps
you should use simple
SUM(1) as NumOverlaps as NumOverlaps
This will give you result that would be equivalent of using
EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps
in your original query
So, try below now:
SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt
FROM [fh-bigquery:reddit_comments.2015_05]
GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit
Upvotes: 5