Rafa Calvo
Rafa Calvo

Reputation: 43

Does pyarrow pq.read_table(my_parquet).to_pandas() place data in memory twice?

So my question is whether using PyArrow's pq.read_table('dataset.parq').to_pandas() will hold in-memory twice the data being read.

One for holding the py arrow table due to pq.read_table('dataset.parq') and second time in dataframe format when doing .to_pandas().

I have a generator which yields pa.RecordBatch.from_arrays() objects, and would like to reduce in-memory usage as much as posible. My current approach has been creating a list of tables and then using pa.Table.from_batches to create a large_table and finally use large_table.to_pandas(). However here I have two objects in memory: 1) large_table (memory views of the list of pyarrow tables) and 2) the pandas dataframe.

My only thought is that I can achieve this by writing to parquet file and then reading to pandas dataframe? But now I am doubting about pq.read_table('dataset.parq').to_pandas().

Upvotes: 0

Views: 60

Answers (1)

assignUser
assignUser

Reputation: 495

You can add zero_copy_only=True to enforce/test if a copy is made: large_table.to_pandas(zero_copy_only=True) see https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas

Upvotes: 1

Related Questions