Reputation: 590
Im trying to play with the reddit data on bigquery and I want to see comments and replies in one row. I see bigquery supports subqueries, but I am unable to construct the query. I have to use a subquery to self join the same table because of the structure of the data, specifically i want to join id and parent_id together, but I need to modify id before I can join. Here is how im trying to do the query:
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p,
(
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
WHERE p.first_id = c.parent_id
AND p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]';
The error I get is:
Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?
Here is where you can run the query: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01
Im not sure how to fix this. what is the proper way to join this and get this query to run?
Upvotes: 4
Views: 19289
Reputation: 172954
You might want to do something like below (just guess):
SELECT
p.subreddit,
p.body AS first_body,
p.score AS first_score,
CONCAT('t1_',p.id) AS first_id ,
c.last_body,
c.last_score,
c.last_id
FROM
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.link_id = c.parent_id
WHERE p.score > 1
AND p.author != '[deleted]'
AND p.body != '[deleted]'
LIMIT 100
See more about JOINs
Please note, I just transformed your query to proper use JOINs, but logic of query is still for you to polish as you see needed
Added to address additional info in your comment:
SELECT
subreddit,
first_body,
first_score,
first_id ,
last_body,
last_score,
last_id
FROM (
SELECT
subreddit,
body AS first_body,
score AS first_score,
CONCAT('t1_',id) AS first_id
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE score > 1
AND author != '[deleted]'
AND body != '[deleted]'
) p
JOIN (
SELECT
body AS last_body,
score AS last_score,
CONCAT('t1_',id) AS last_id,
parent_id,
author,
body
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE body != '[deleted]'
AND author != '[deleted]'
AND score > 1
) c
ON p.first_id = c.parent_id
LIMIT 100
Upvotes: 6
Reputation: 7046
In BigQuery's SQL dialect, comma means UNION ALL instead of JOIN. You need to write the JOIN explicitly, using the JOIN keyword.
I would also recommend pushing both sides of the join into subqueries to ensure that all filters are applied before the join is executed. (The join is by far the most expensive part of the query, so applying the filters first will ensure that your query runs as fast as possible.)
Upvotes: 0