Jan Naidu
Jan Naidu

Reputation: 31

Can Google Dataflow generate Parquet files

Can Google Dataflow generate Parquet files as the output of an ETL transformation.

Input ---> Dataflow -----> Parquet files

Upvotes: 2

Views: 1074

Answers (2)

Steven Ensslen
Steven Ensslen

Reputation: 1386

Cloud Dataflow has supported writing Parquet since parquetio was introduced in 2.10 in February 2019. From the docs

with beam.Pipeline() as p:
  records = p | 'Read' >> beam.Create(
    [{'name': 'foo', 'age': 10}, {'name': 'bar', 'age': 20}]
  )
  _ = records | 'Write' >> beam.io.WriteToParquet(filename,
  pyarrow.schema(
      [('name', pyarrow.binary()), ('age', pyarrow.int64())]
  )
)

Upvotes: 4

jkff
jkff

Reputation: 17913

Cloud Dataflow does not have a built-in way of generating Parquet files, but based on a quick look at the Parquet API, it should be relatively easy to implement a custom file-based Dataflow sink doing that (see "FileBasedSink" there).

Upvotes: 1

Related Questions