Reputation: 249
I have a table student_record
with two columns
I need to do analysis on each subject separately and there are over 100 subjects.
Right now I just loop through all the subject s
and save the output of the following query to a dataframe and do the analysis
SELECT studentId, res.subject, res.score
FROM student_record, UNNEST(result) res
WHERE res.subject = s
This query could take a long time to finish (100 + subjects, 100 million students) and it needs to be run for each subject.
I am wondering if there is a better way to perform such a task with parallel processing in BQ (e.g. run a single query and save results into local files indexed by subject?).
Upvotes: 0
Views: 686
Reputation: 3618
This query is very straightforward and should be pretty quick. If you are writing millions of rows to a dataframe, that is probably your bottleneck. I would consider one of the following approaches:
with data as (
select studentId, res.subject, res.score
from student_record, unnest(result) res
)
select
subject,
count(distinct studentID) as student_count,
avg(score) as avg_score,
max(score) as max_score,
min(score) as min_score,
variance(score) as var_score,
stddev(score) as std_dev_score,
--- etc etc
from data
group by subject
create table dataset.student_record_clustered_by_subject
(
studentId string, -- or int depending on makeup of your column
subject string,
score int -- or decimal if you have decimal places
)
cluster by subject
as (
select studentId, res.subject, res.score
from student_record, unnest(result) res
);
Upvotes: 1