Reputation: 23
I am not able to figure out the precise functions in GCP Dataflow Python SDK that read from and write to csv files (or any non-txt files for that matter). For BigQuery, I have figured out the following functions:
beam.io.Read(beam.io.BigQuerySource('%Table_ID%')) beam.io.Write(beam.io.BigQuerySink('%Table_ID%'))
For reading textfiles, the ReadFromText and WriteToText functions are known to me.
However, I am not able to find any examples for GCP Dataflow Python SDK in which data is written to or read from csv files. Please could you provide the GCP Dataflow Python SDK functions for reading from and writing to csv files in the same manner as I have done for the functions relating to BigQuery above?
Upvotes: 2
Views: 2500
Reputation: 76
There is a CsvFileSource
in the beam_utils
PyPi package repository, that reads .csv files, deals with file headers, and can set custom delimiters. More information on how to use this source in this answer. Hope that helps!
Upvotes: 3
Reputation: 11041
CSV files are text files. The simplest (though somewhat inelegant) way of reading them would be to do a ReadFromText
, and then split the lines read on the commas (e.g. beam.Map(lambda x: x.split(','))
).
For the more elegant option, check out this question, or simply use the beam_utils
pip repository and use the beam_utils.sources.CsvFileSource
source to read from.
Upvotes: 1