Reputation: 319
I am working on Stackoverflow posts & comments extraction. There are three tables: questions, answers, and comments. The schemas for them are shown below, and you can also look at them in here.
Basically, I intend to retrieve all the comments, questions, and answers that share the same tags.
The questions schema:
tags
is the question tag collections looking like javascript|node.js|stream|buffer
in content.
The answers schema:
parent_id
is the corresponding question id for the answer.
The comment schema:
post_id
is its attached post id, regardless the post is a question or answer.
For instance, how to extract all the questions, answers, and comments that share tensorflow
tag? I appreciate your help!
The expected extracted tables look like this:
question.id|question.title|question.body|answer.id|answer.title|answer.body
and
question.id|question.title|question.body|comment.id|comment.text
and
answer.id|answer.title|answer.body|comment.id|comment.text
The fields are too many for all, so I just name a few to make things clear. For the first table, I already have it by SQL enquiry:
select *
from question
left outer join answer
on answer.parent_id = question.id
where question.tags like "%tensorflow%"
Upvotes: 0
Views: 1764
Reputation: 1945
There is some ways to do so. The is is that a question may have multiple answers, so you can return the question repeated with all its answers, or you can use arrays and structs to return each question once, with all its multiple answers.
left join
because you can't have an answer without a question):select q.tags question_tag, q.id question_id, q.title question_title, q.body question_body, a.id answer_id, a.body from `bigquery-public-data.stackoverflow.posts_questions` as q
left join
`bigquery-public-data.stackoverflow.posts_answers` as a
on a.parent_id = q.id
where regexp_contains(q.tags, 'java');
with agg_answer as (
select parent_id, ARRAY_AGG(struct(id,body)) from `bigquery-public-data.stackoverflow.posts_answers` group by parent_id
)
select q.tags question_tag, q.id question_id, q.title question_title, q.body question_body, a.*
from
`bigquery-public-data.stackoverflow.posts_questions` as q
left join
agg_answer as a
on q.id = a.parent_id
where regexp_contains(q.tags,'tensorflow');
question
x comments
:with agg_comments as (
select post_id, ARRAY_AGG(struct(id,text)) from `bigquery-public-data.stackoverflow.comments` group by post_id
)
select q.tags question_tag, q.id question_id, q.title question_title, q.body question_body, c.*
from
`bigquery-public-data.stackoverflow.posts_questions` as q
left join
agg_comments as c
on q.id = c.post_id
where regexp_contains(q.tags,'java' )
-- questions with or without comments ( remove follow line for all questions)
and c.post_id is not null;
select q.tags, q.id question_id, q.title question_title, q.body question_body, a.id answer_id, a.text comment_text from `bigquery-public-data.stackoverflow.posts_questions` as q
left join
`bigquery-public-data.stackoverflow.comments` as a
on a.post_id = q.id
where regexp_contains(q.tags, 'java')
-- questions with or without comments ( remove follow line for all questions)
and a.post_id is not null;
repeat for other tables you want.
Upvotes: 3