NoThangButtaChknWang
NoThangButtaChknWang

Reputation: 31

How can I read and manipulate large csv files in Google Colaboratory while not using all the RAM?

I am trying to import and manipulate compressed .csv files (that are each about 500MB in compressed form) in Google Colaboratory. There are 7 files. Using pandas.read_csv(), I "use all the available RAM" just after 2 files are imported and I have to restart my runtime.

I have searched forever on here looking for answers and have tried all the ones I came across, but none work. I have the files in my google drive and am mounted to it.

How can I read all of the files and manipulate them without using all the RAM? I have 12.72GB of RAM and 358.27GM of Disk.

Buying more RAM isn't an option.

Upvotes: 2

Views: 1312

Answers (1)

NoThangButtaChknWang
NoThangButtaChknWang

Reputation: 31

To solve my problem, I created 7 cells (one for each data file). Within each cell I read the file, manipulated it, saved what I needed, then deleted everything:

import pandas as pd
import gc

df = pd.read_csv('Google drive path', compression = 'gzip')
filtered_df = df.query('my query condition here')
filtered_df.to_csv('new Google drive path', compression = 'gzip')

del df
del filtered_df

gc.collect()

After all 7 files, each about 500MB, for a total row-by-column size of 7,000,000 by 100, my RAM has stayed under 1MB.

Just using del didn't free up enough RAM. I had to use gc.collect() after in each cell.

Upvotes: 1

Related Questions