Reputation: 24776
Use the finer grained reading and writing example for the pyarrow documentation
writer = pq.ParquetWriter('example2.parquet', table.schema)
for i in range(300000):
writer.write_table(table)
# Is it possible to know how many bytes have been written out to the files after compression?
writer.close()
I want to limit the output file size to 200-300MB, If tried to get the writer.file_handle.size()
but that raises a *** OSError: only valid on readable files
. Is there a good way to limit the size the output file?
Upvotes: 3
Views: 1453
Reputation: 24776
I don't know any way to limit the outfile file size directly but it's possible to control it indirectly by limiting the number of bytes fed to the ParquetWriter
.
bytes_written = 0
index = 0
writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
for i in range(300000):
writer.write_table(table)
bytes_written = bytes_written + table.nbytes
if bytes_written >= 500000000: # 500MB, start a new file
writer.close()
index = index + 1
writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
bytes_written = 0
writer.close()
In my case each 500MB of uncompressed table data gives a parquet file around 45MB. The files will be named test_000.parquet
, test_001.parquet
, etc.
Upvotes: 1