Revital Eres
Revital Eres

Reputation: 263

Using arrow write_dataset function to append parquet data

We are using arrow dataset write_dataset functionin pyarrow to write arrow data to a base_dir - "/tmp" in a parquet format. When the base_dir is empty part-0.parquet file is created. however when trying to write again new data to the base_dir part-0.parquet is overwritten. I would expect to see part-1.parquet with the new data in base_dir. Thanks

Upvotes: 2

Views: 2151

Answers (1)

alenka
alenka

Reputation: 521

What you can use is a combination of parameters basename_template and existing_data_behavior of the ds.write_dataset() to generate random or specific pattern to rename partition filenames. For example:

>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2020, 2022, 2021],
...                    'b': [3, 5, 7]})
>>> table = pa.Table.from_pandas(df)
>>> 
>>> for k in range(3):
...     ds.write_dataset(table, base_dir="dataset", format="parquet",
...                      partitioning=['a'],
...                      basename_template="part-{{i}}-{}.parquet".format(k),
...                      existing_data_behavior='overwrite_or_ignore' # you could also use 'delete_matching'
...                     )
...     print(k)
...     ds.dataset("dataset", format="parquet").files
... 
0
['dataset/2020/part-0-0.parquet', 'dataset/2021/part-0-0.parquet', 'dataset/2022/part-0-0.parquet']
1
['dataset/2020/part-0-0.parquet', 'dataset/2020/part-0-1.parquet', 'dataset/2021/part-0-0.parquet', 'dataset/2021/part-0-1.parquet', 'dataset/2022/part-0-0.parquet', 'dataset/2022/part-0-1.parquet']
2
['dataset/2020/part-0-0.parquet', 'dataset/2020/part-0-1.parquet', 'dataset/2020/part-0-2.parquet', 'dataset/2021/part-0-0.parquet', 'dataset/2021/part-0-1.parquet', 'dataset/2021/part-0-2.parquet', 'dataset/2022/part-0-0.parquet', 'dataset/2022/part-0-1.parquet', 'dataset/2022/part-0-2.parquet']
>>> 

See also:

Upvotes: 2

Related Questions