Trevor M.
Trevor M.

Reputation: 43

Self join in big query is running very slowly, am I following best practices?

I'm creating a table of the number of overlapping commentators between Reddit subreddits via the following self join:

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit;

Typical queries I have made on this data set in Big Query finish very quickly (< 1 min) but this query has run for over an hour and still not finished. This data has 54,504,410 rows and 22 columns.

Am I missing obvious speed ups that I should be implementing to make this query run quickly? Thanks!

Upvotes: 4

Views: 5917

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

try below

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

It does two things
First, it pre-aggregates data to avoid redundant joining
Second, it eliminates "potential outliers" - those authors who have just one post for subreddit. Of course, second item depends on your use case. But most likely it should be ok and thus resolving performance issue. if still slower than you expected - increase threshold to 2 or greater

Follow up: ... 22,545,850,104 ... seems incorrect ... Should it be SUM(t1.cnt+t2.cnt)?

Sure it is incorrect, but it is exactly what you would get if you were able to run your query in question!
And my hope was that you will be able to catch this!
So, I am glad that fixing “performance” issue – opened your eyes on issue with logic in your original query!

So, yes, obviously 22,545,850,104 is incorrect number.
So, instead of

    SUM(t1.cnt*t2.cnt) as NumOverlaps   

you should use simple

    SUM(1) as NumOverlaps as NumOverlaps   

This will give you result that would be equivalent of using

    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   

in your original query

So, try below now:

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit<t2.subreddit
GROUP BY t1.subreddit, t2.subreddit

Upvotes: 5

Related Questions