Reputation: 3051
I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy
-compressed parquet file.
However when I am saving my dataset as partitioned files, they result in a much larger sizes combined (61MB).
Here is example dataset that I am trying to save:
listing_id | date | gender | price
-------------------------------------------
a | 2019-01-01 | M | 100
b | 2019-01-02 | M | 100
c | 2019-01-03 | F | 200
d | 2019-01-04 | F | 200
When I partitioned by date (300+ unique values), the partitioned files result in 61MB combined. Each file has 168.2kB
of size.
When I partition by gender (2 unique values), the partitioned files result in just 3MB combined.
I am wondering if there is any minimum file size for parquet such that many small files combined consume greater disk space?
My env:
- OS: Ubuntu 18.04
- Language: Python
- Library: pyarrow, pandas
My dataset source:
https://www.kaggle.com/brittabettendorf/berlin-airbnb-data
# I am using calendar_summary.csv as my data from a group of datasets in that link above
My code to save as parquet file:
# write to dataset using parquet
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table=table, where='./calendar_summary_write_table.parquet')
# parquet filesize
parquet_method1_filesize = os.path.getsize('./calendar_summary_write_table.parquet') / 1000
print('parquet_method1_filesize: %i kB' % parquet_method1_filesize)
My code to save as partitioned parquet file:
# write to dataset using parquet (partitioned)
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(
table=table,
root_path='./calendar_summary/',
partition_cols=['date'])
# parquet filesize
import os
print(os.popen('du -sh ./calendar_summary/').read())
Upvotes: 6
Views: 2857
Reputation: 3105
There is no minimum file size, but there is an overhead for storing the footer and there is a wasted opportunity for optimizations via encodings and compressions. The various encodings and compressions build on the idea that the data has some amount of self-similarity which can be exploited by referencing back to earlier similar occurances. When you split the data into multiple files, each of them will need a separate "initial data point" that the successive ones can refer back to, so disk usage goes up. (Please note that there are huge oversimplifications in this wording to avoid having to specifically go through the various techniques employed to save space, but see this answer for a few examples.)
Another thing that can have a huge impact on the size of Parquet files is the order in which data is inserted. A sorted column can be stored a lot more efficiently than a randomly ordered one. It is possible that by partitioning the data you inadvertently alter its sort order. Another possibility is that you partition the data by the very attribute that it was ordered by and which allowed a huge space saving when storing in a single file and this opportunity gets lost by splitting the data into multiple files. Finally, you have to keep in mind that Parquet is not optimized for storing a few kilobytes of data but for several megabytes or gigabytes (in a single file) or several petabytes (in multiple files).
If you would like to inspect how your data is stored in your Parquet files, the Java implementation of Parquet includes the parquet-tools
utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools
itself. The commands most interesting to you are probably meta
and dump
.
Upvotes: 6