Reputation: 2913
How do I iterate through a reader
, assuming its a pyarrow._flight.FlightStreamReader
object.
Which can be obtained from
reader = client.do_get(flight_info.endpoints[0].ticket, options)
Entire example.py
script came from https://github.com/dremio-hub/arrow-flight-client-examples/blob/main/python/example.py
Currently I try reader.read_pandas()
so that it will generate a dataframe for the entire Dremio results. Unfortunately, if the query has over 50 million rows or so, it may not fit to the dataframe/or may not have enough memory for it, and my process just gets killed. How do I iterate through the reader object and just get chunks so I can maybe generate dataframe per chunk.
When I use
for chunk in reader.read_chunk():
print(chunk.to_pandas())
for the first chunk, it will convert/extract only 3968 rows from the results and put it in a dataframe, but for the second chunk its a None
object. My example really has a millions rows.
In short how can I iterate through the reader per specified chunk size? And is it possible to print these chunks per rows without converting it to a dataframe?
Upvotes: 1
Views: 857
Reputation: 21
I wrote the following and it works great
while True:
try:
batch, buf = reader.read_chunk()
yield batch
except StopIteration:
break
The cunsumer function doing the batch.to_pandas()
The missing part is how to configure the chunk size
Upvotes: 2