pa-nguyen
pa-nguyen

Reputation: 405

Apache beam dataframe write csv to GCS without shard name template

I have a Dataflow pipeline using Apache Beam dataframe, and I'd like to write the csv to a GCS bucket. This is my code:

with beam.Pipeline(options=pipeline_options) as p:
    df = p | read_csv(known_args.input)
    df[column] = df.groupby(primary_key)[column].apply(lambda x: x.ffill().bfill()))
    df.to_csv(known_args.output, index=False, encoding='utf-8')

However, while I pass a gcs path to known_args.output, the written csv on gcs is added with shard, like this gs://path/to/file-00000-of-00001. For my project, I need the file name to be without the shard. I've read the documentation but there seems to be no options to remove the shard. I tried converting the df back to pcollection and use WriteToText but it doesn't work either, and also not a desirable solution.

Upvotes: 1

Views: 1644

Answers (1)

robertwb
robertwb

Reputation: 5104

It looks like you're right; in Beam 2.40 there's no way to customize the sharding of these dataframe write operations. Instead, you'll have to use convert to a PCollection and use WriteToText(..., shard_name_template='')

I filed BEAM-22923. When the relevant PR is merged this fixed will allow one to pass an explicit file naming parameter (which will allow customization of this as well as windowing information), e.g.

df.to_csv(
    output_dir,
    num_shards=1,
    file_naming=fileio.single_file_naming('out.csv'))

.

Upvotes: 2

Related Questions