Self join in big query is running very slowly, am I following best practices?

Question

I'm creating a table of the number of overlapping commentators between Reddit subreddits via the following self join:

SELECT t1.subreddit, t2.subreddit, COUNT(*) as NumOverlaps
FROM [fh-bigquery:reddit_comments.2015_05] t1
JOIN [fh-bigquery:reddit_comments.2015_05] t2
ON t1.author=t2.author
WHERE t1.subreddit



Typical queries I have made on this data set in Big Query finish very quickly (< 1 min) but this query has run for over an hour and still not finished. This data has 54,504,410 rows and 22 columns. 

Am I missing obvious speed ups that I should be implementing to make this query run quickly? Thanks!

Mikhail Berlyant · Accepted Answer

try below

SELECT t1.subreddit, t2.subreddit, SUM(t1.cnt*t2.cnt) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit



It does two things

First, it pre-aggregates data to avoid redundant joining

Second, it eliminates "potential outliers" - those authors who have just one post for subreddit. Of course, second item depends on your use case. But most likely it should be ok and thus resolving performance issue. if still slower than you expected  - increase threshold to 2 or greater


  Follow up: ... 22,545,850,104 ...  seems incorrect ...
  Should it be SUM(t1.cnt+t2.cnt)?


Sure it is incorrect, but it is exactly what you would get if you were able to run your query in question!

And my hope was that you will be able to catch this!

So, I am glad that fixing “performance” issue – opened your eyes on issue with logic in your original query!

So, yes, obviously 22,545,850,104 is incorrect number.

So, instead of   

    SUM(t1.cnt*t2.cnt) as NumOverlaps   


you should use simple   

    SUM(1) as NumOverlaps as NumOverlaps   


This will give you result that would be equivalent of using   

    EXACT_COUNT_DISTINCT(t1.author) as NumOverlaps   


in your original query

So, try below now:    

SELECT t1.subreddit, t2.subreddit, SUM(1) as NumOverlaps
FROM (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t1
JOIN (SELECT subreddit, author, COUNT(1) as cnt 
      FROM [fh-bigquery:reddit_comments.2015_05] 
      GROUP BY subreddit, author HAVING cnt > 1) t2
ON t1.author=t2.author
WHERE t1.subreddit

Self join in big query is running very slowly, am I following best practices?

Answers (1)

Related Questions