Does pyarrow pq.read_table(my_parquet).to_pandas() place data in memory twice?

Question

So my question is whether using PyArrow's pq.read_table('dataset.parq').to_pandas() will hold in-memory twice the data being read.

One for holding the py arrow table due to pq.read_table('dataset.parq') and second time in dataframe format when doing .to_pandas().

I have a generator which yields pa.RecordBatch.from_arrays() objects, and would like to reduce in-memory usage as much as posible. My current approach has been creating a list of tables and then using pa.Table.from_batches to create a large_table and finally use large_table.to_pandas(). However here I have two objects in memory: 1) large_table (memory views of the list of pyarrow tables) and 2) the pandas dataframe.

My only thought is that I can achieve this by writing to parquet file and then reading to pandas dataframe? But now I am doubting about pq.read_table('dataset.parq').to_pandas().

Does pyarrow pq.read_table(my_parquet).to_pandas() place data in memory twice?

Answers (1)

Related Questions