Reputation: 43
So my question is whether using PyArrow's pq.read_table('dataset.parq').to_pandas()
will hold in-memory twice the data being read.
One for holding the py arrow table due to pq.read_table('dataset.parq')
and second time in dataframe format when doing .to_pandas()
.
I have a generator which yields pa.RecordBatch.from_arrays()
objects, and would like to reduce in-memory usage as much as posible. My current approach has been creating a list of tables and then using pa.Table.from_batches
to create a large_table
and finally use large_table.to_pandas()
. However here I have two objects in memory: 1) large_table
(memory views of the list of pyarrow tables) and 2) the pandas dataframe.
My only thought is that I can achieve this by writing to parquet file and then reading to pandas dataframe? But now I am doubting about pq.read_table('dataset.parq').to_pandas()
.
Upvotes: 0
Views: 60
Reputation: 495
You can add zero_copy_only=True
to enforce/test if a copy is made:
large_table.to_pandas(zero_copy_only=True)
see https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas
Upvotes: 1