Reputation: 263
We are using arrow dataset write_dataset functionin pyarrow to write arrow data to a base_dir - "/tmp" in a parquet format. When the base_dir is empty part-0.parquet file is created. however when trying to write again new data to the base_dir part-0.parquet is overwritten. I would expect to see part-1.parquet with the new data in base_dir. Thanks
Upvotes: 2
Views: 2151
Reputation: 521
What you can use is a combination of parameters basename_template
and existing_data_behavior
of the ds.write_dataset()
to generate random or specific pattern to rename partition filenames. For example:
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2020, 2022, 2021],
... 'b': [3, 5, 7]})
>>> table = pa.Table.from_pandas(df)
>>>
>>> for k in range(3):
... ds.write_dataset(table, base_dir="dataset", format="parquet",
... partitioning=['a'],
... basename_template="part-{{i}}-{}.parquet".format(k),
... existing_data_behavior='overwrite_or_ignore' # you could also use 'delete_matching'
... )
... print(k)
... ds.dataset("dataset", format="parquet").files
...
0
['dataset/2020/part-0-0.parquet', 'dataset/2021/part-0-0.parquet', 'dataset/2022/part-0-0.parquet']
1
['dataset/2020/part-0-0.parquet', 'dataset/2020/part-0-1.parquet', 'dataset/2021/part-0-0.parquet', 'dataset/2021/part-0-1.parquet', 'dataset/2022/part-0-0.parquet', 'dataset/2022/part-0-1.parquet']
2
['dataset/2020/part-0-0.parquet', 'dataset/2020/part-0-1.parquet', 'dataset/2020/part-0-2.parquet', 'dataset/2021/part-0-0.parquet', 'dataset/2021/part-0-1.parquet', 'dataset/2021/part-0-2.parquet', 'dataset/2022/part-0-0.parquet', 'dataset/2022/part-0-1.parquet', 'dataset/2022/part-0-2.parquet']
>>>
See also:
Upvotes: 2