Nagaraju
Nagaraju

Reputation: 23

Google Cloud Dataflow (Python): function to read from and write to a .csv file?

I am not able to figure out the precise functions in GCP Dataflow Python SDK that read from and write to csv files (or any non-txt files for that matter). For BigQuery, I have figured out the following functions:

beam.io.Read(beam.io.BigQuerySource('%Table_ID%')) beam.io.Write(beam.io.BigQuerySink('%Table_ID%'))

For reading textfiles, the ReadFromText and WriteToText functions are known to me.

However, I am not able to find any examples for GCP Dataflow Python SDK in which data is written to or read from csv files. Please could you provide the GCP Dataflow Python SDK functions for reading from and writing to csv files in the same manner as I have done for the functions relating to BigQuery above?

Upvotes: 2

Views: 2500

Answers (2)

Flavio Fiszman
Flavio Fiszman

Reputation: 76

There is a CsvFileSource in the beam_utils PyPi package repository, that reads .csv files, deals with file headers, and can set custom delimiters. More information on how to use this source in this answer. Hope that helps!

Upvotes: 3

Pablo
Pablo

Reputation: 11041

CSV files are text files. The simplest (though somewhat inelegant) way of reading them would be to do a ReadFromText, and then split the lines read on the commas (e.g. beam.Map(lambda x: x.split(','))).

For the more elegant option, check out this question, or simply use the beam_utils pip repository and use the beam_utils.sources.CsvFileSource source to read from.

Upvotes: 1

Related Questions