PyArrow ParquetWriter: Is there a way to limit the size of the output file (splitting)?

Question

Use the finer grained reading and writing example for the pyarrow documentation

writer = pq.ParquetWriter('example2.parquet', table.schema)

for i in range(300000):
  writer.write_table(table)
  # Is it possible to know how many bytes have been written out to the files after compression?

writer.close()

I want to limit the output file size to 200-300MB, If tried to get the writer.file_handle.size() but that raises a *** OSError: only valid on readable files. Is there a good way to limit the size the output file?

RubenLaguna · Accepted Answer

I don't know any way to limit the outfile file size directly but it's possible to control it indirectly by limiting the number of bytes fed to the ParquetWriter.

bytes_written = 0
index = 0
writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)

for i in range(300000):
  writer.write_table(table)
  bytes_written = bytes_written + table.nbytes
  if bytes_written >= 500000000: # 500MB, start a new file
    writer.close()
    index = index + 1
    writer = pq.ParquetWriter(f'output_{index:03}.parquet', table.schema)
    bytes_written = 0

writer.close()

In my case each 500MB of uncompressed table data gives a parquet file around 45MB. The files will be named test_000.parquet, test_001.parquet, etc.

PyArrow ParquetWriter: Is there a way to limit the size of the output file (splitting)?

Answers (1)

Related Questions