Reputation: 31
Can Google Dataflow generate Parquet files as the output of an ETL transformation.
Input ---> Dataflow -----> Parquet files
Upvotes: 2
Views: 1074
Reputation: 1386
Cloud Dataflow has supported writing Parquet since parquetio
was introduced in 2.10 in February 2019. From the docs
with beam.Pipeline() as p:
records = p | 'Read' >> beam.Create(
[{'name': 'foo', 'age': 10}, {'name': 'bar', 'age': 20}]
)
_ = records | 'Write' >> beam.io.WriteToParquet(filename,
pyarrow.schema(
[('name', pyarrow.binary()), ('age', pyarrow.int64())]
)
)
Upvotes: 4
Reputation: 17913
Cloud Dataflow does not have a built-in way of generating Parquet files, but based on a quick look at the Parquet API, it should be relatively easy to implement a custom file-based Dataflow sink doing that (see "FileBasedSink" there).
Upvotes: 1