Reputation: 23
I am trying to read a 40GB file in pandas and perform some operations on it. I am using chunk, but getting MemoryError. (RAM of System = 32 GB)
Code
df = pd.DataFrame()
for chunk in pd.read_csv('file.csv',low_memory = False, chunksize = 50000):
df = df.append(chunk)
How should my code be in order to read the large file ?
Upvotes: 2
Views: 1676
Reputation: 2245
"You can't have a DataFrame larger than your machine's RAM."
https://tomaugspurger.github.io/modern-8-scaling.html
If you're reading 40GB file into 32GB of RAM, I don't think that'll work. Can you perform your operations on the chunks themselves and save it instead of doing it on the entire dataset at once?
BTW, if you're build a DataFrame from chunks, rather than appending each chunk to the same DataFrame in each iteration, it'll be faster to collect them in a list and then concat them at the end. Otherwise, Pandas has to create a new massive dataframe at each iteration.
dfs = []
for chunk in pd.read_csv('file.csv',low_memory = False, chunksize = 50000):
dfs.append(chunk)
df = pd.concat(dfs)
Upvotes: 3