chunk the large BigQuery response and save the chunks in CSV file using Apache Beam and Dataflow

Question

I am new to Apache Beam and Dataflow. I am trying to fetch the large set of data ~20000 records. I have to chunk it for 1000 records and save the chunks in separate CSV files. I know how to read from BQ and write to CSV, but not able to understand how to chunk the files using beam transform or if there are any other ways.

What I tried: I started with simple code, where I am passing the data that I read from BQ and to ParDo function. Also I am not getting how I can use ParDo to chunk the records or if this is not the right approach please guide me though right direction.

Also ParDo is not printing the element I am passing in the below code.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

class Printer(beam.DoFn):
   def process(self, element):
      print(element) 
      yield element
           
def run():
   with beam.Pipeline() as p:
        pcoll = (p
                  | "ReadFromBigQuery" >> beam.io.ReadFromBigQuery(
        query='SELECT email, name, age FROM `my_db`;', use_standard_sql=True)
                  | "par" >> beam.ParDo(Printer())
                  | "Print for now" >> beam.Map(print)
                  )

   result = p.run()
   result.wait_until_finish()

if __name__ == '__main__':
   run()

Thank you for any help.

chunk the large BigQuery response and save the chunks in CSV file using Apache Beam and Dataflow

Answers (1)

Related Questions