Reputation: 1584
Suppose I have a large amount of data that I am loading into a dataframe by chunks;? For eg : I have a table which is more than 40 Gb and selecting 3 columns may be around 2 - 3 gb suppose and records are 10 million (count of rows)
c = pd.read_sql("select a,b,c from table;", con=db, chunksize=10**2):
b = c['a']
Since It is reading the table chunk by chunk does it mean it is not loading the whole 3 gb in memory at once and operate only on 10^2 mb at once then goto next chunk automatically??
If not, how to make it behave like this?
Upvotes: 0
Views: 291
Reputation: 1158
Quoting the documentation
chunksize : int, default None
If specified, return an iterator where chunksize is the number of rows
to include in each chunk.
So first of all, chunksize denotes the number of rows and not the size in mb. Providing a chunksize also has the effect, that an iterator is returned instead of a dataframe. So you need to loop over that. Given that, on the python side you only need memory for the 10^2 rows.
Upvotes: 1