Viv
Viv

Reputation: 1584

How does dataframe store large amount of data in memory and manipulate?

Suppose I have a large amount of data that I am loading into a dataframe by chunks;? For eg : I have a table which is more than 40 Gb and selecting 3 columns may be around 2 - 3 gb suppose and records are 10 million (count of rows)

c = pd.read_sql("select a,b,c from table;", con=db, chunksize=10**2):
b = c['a']

Since It is reading the table chunk by chunk does it mean it is not loading the whole 3 gb in memory at once and operate only on 10^2 mb at once then goto next chunk automatically??

If not, how to make it behave like this?

Upvotes: 0

Views: 291

Answers (1)

nucleon
nucleon

Reputation: 1158

Quoting the documentation

chunksize : int, default None
    If specified, return an iterator where chunksize is the number of rows
    to include in each chunk.

So first of all, chunksize denotes the number of rows and not the size in mb. Providing a chunksize also has the effect, that an iterator is returned instead of a dataframe. So you need to loop over that. Given that, on the python side you only need memory for the 10^2 rows.

Upvotes: 1

Related Questions