Bigquery parallel processing based on column value

Question

I have a table student_record with two columns

studentId : int
result: array of tuple (int, int): (subjectId, score)

I need to do analysis on each subject separately and there are over 100 subjects. Right now I just loop through all the subject s and save the output of the following query to a dataframe and do the analysis

SELECT studentId, res.subject, res.score 
FROM student_record, UNNEST(result) res 
WHERE res.subject = s

This query could take a long time to finish (100 + subjects, 100 million students) and it needs to be run for each subject.

I am wondering if there is a better way to perform such a task with parallel processing in BQ (e.g. run a single query and save results into local files indexed by subject?).

rtenha · Accepted Answer

This query is very straightforward and should be pretty quick. If you are writing millions of rows to a dataframe, that is probably your bottleneck. I would consider one of the following approaches:

Try to do your analysis in BQ rather than in a script. This depends on the analysis you are doing, but BQ has basic statistical functions.

with data as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
)
select
    subject,
    count(distinct studentID) as student_count,
    avg(score) as avg_score,
    max(score) as max_score,
    min(score) as min_score,
    variance(score) as var_score,
    stddev(score) as std_dev_score,
    --- etc etc
from data
group by subject

If you do need to write every studentID and score to a dataframe for each subject, I suggest materializing your query to a table and cluster by subject. Your subsequent queries (when filtered by subject) will be more efficient (and cheaper!).

create table dataset.student_record_clustered_by_subject
(  
  studentId string, -- or int depending on makeup of your column
  subject string, 
  score int -- or decimal if you have decimal places
)
cluster by subject
as (
      select studentId, res.subject, res.score 
      from student_record, unnest(result) res 
);

Bigquery parallel processing based on column value

Answers (1)

Related Questions