Jimmy Zhao
Jimmy Zhao

Reputation: 319

Querying Stackoverflow public dataset on BigQuery - SQL multiple join

I am working on Stackoverflow posts & comments extraction. There are three tables: questions, answers, and comments. The schemas for them are shown below, and you can also look at them in here.

Basically, I intend to retrieve all the comments, questions, and answers that share the same tags.

The questions schema:

enter image description here

tags is the question tag collections looking like javascript|node.js|stream|buffer in content.

The answers schema:

enter image description here

parent_id is the corresponding question id for the answer.

The comment schema:

enter image description here

post_id is its attached post id, regardless the post is a question or answer.

For instance, how to extract all the questions, answers, and comments that share tensorflow tag? I appreciate your help!

The expected extracted tables look like this:

question.id|question.title|question.body|answer.id|answer.title|answer.body

and

question.id|question.title|question.body|comment.id|comment.text

and

answer.id|answer.title|answer.body|comment.id|comment.text

The fields are too many for all, so I just name a few to make things clear. For the first table, I already have it by SQL enquiry:

select * 
from question
left outer join answer
on answer.parent_id = question.id
where question.tags like "%tensorflow%"

Upvotes: 0

Views: 1764

Answers (1)

ewertonvsilva
ewertonvsilva

Reputation: 1945

There is some ways to do so. The is is that a question may have multiple answers, so you can return the question repeated with all its answers, or you can use arrays and structs to return each question once, with all its multiple answers.


  • Easy way, same question multiple times (use left join because you can't have an answer without a question):
select q.tags question_tag, q.id question_id, q.title question_title, q.body question_body, a.id answer_id, a.body from `bigquery-public-data.stackoverflow.posts_questions` as q
left join
`bigquery-public-data.stackoverflow.posts_answers` as a
on a.parent_id = q.id
where regexp_contains(q.tags, 'java');
  • Using arrays and structs to return the question each time with all its answers:
with agg_answer as (
select parent_id, ARRAY_AGG(struct(id,body)) from `bigquery-public-data.stackoverflow.posts_answers` group by parent_id
)
select q.tags question_tag, q.id question_id, q.title question_title, q.body question_body, a.*
from 
`bigquery-public-data.stackoverflow.posts_questions` as q
left join 
agg_answer as a
on q.id = a.parent_id
where regexp_contains(q.tags,'tensorflow');

  • same for question x comments:

with agg_comments as (
select post_id, ARRAY_AGG(struct(id,text)) from `bigquery-public-data.stackoverflow.comments` group by post_id
)
select q.tags question_tag, q.id question_id, q.title question_title, q.body question_body, c.*
from 
`bigquery-public-data.stackoverflow.posts_questions` as q
left join 
agg_comments as c
on q.id = c.post_id
where regexp_contains(q.tags,'java' )
-- questions with or without comments ( remove follow line for all questions)
and c.post_id is not null;
select q.tags, q.id question_id, q.title question_title, q.body question_body, a.id answer_id, a.text comment_text from `bigquery-public-data.stackoverflow.posts_questions` as q
left join
`bigquery-public-data.stackoverflow.comments` as a
on a.post_id = q.id
where regexp_contains(q.tags, 'java')
-- questions with or without comments ( remove follow line for all questions)
and a.post_id is not null;


repeat for other tables you want.

Upvotes: 3

Related Questions