Reputation: 646
Is that possible to write the same Parquet folder from different processes in Python?
I use fastparquet
.
It seems to work but I m wondering how it is possible for the _metadata
file to not have conflicts in case two processes write to it at the same it.
Also to make it works I had to use ignore_divisions=True
which is not ideal to get fast performance later when you read the Parquet file right?
Upvotes: 2
Views: 808
Reputation: 28683
Dask consolidates the metadata from the separate processes, so that it only writes the _metadata
file once the rest is complete, and this happens in a single thread.
If you were writing separate parquet files to a single folder using your own multiprocessing setup, each would typically write the single data file and no _metadata
at all. You could either gather the pieces like Dask does, or consolidate the metadata from the data files after they were ready.
Upvotes: 2