Memory efficient import many data files into pandas DataFrame in Python

Question

I import into a pandas DataFrame a directory of |-delimited.dat files. The following code works, but I eventually run out of RAM with a MemoryError:.

import pandas as pd
import glob

temp = []
dataDir = 'C:/users/richard/research/data/edgar/masterfiles'
for dataFile in glob.glob(dataDir + '/master_*.dat'):
    print dataFile
    temp.append(pd.read_table(dataFile, delimiter='|', header=0))

masterAll = pd.concat(temp)

Is there a more memory efficient approach? Or should I go whole hog to a database? (I will move to a database eventually, but I am baby stepping my move to pandas.) Thanks!

FWIW, here is the head of an example .dat file:

cik|cname|ftype|date|fileloc
1000032|BINCH JAMES G|4|2011-03-08|edgar/data/1000032/0001181431-11-016512.txt
1000045|NICHOLAS FINANCIAL INC|10-Q|2011-02-11|edgar/data/1000045/0001193125-11-031933.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-11|edgar/data/1000045/0001193125-11-005531.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-27|edgar/data/1000045/0001193125-11-015631.txt
1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2011-02-14|edgar/data/1000045/0000929638-11-00151.txt

Bakuriu · Accepted Answer

Usually, if you mind memory usage, it's better to use generators instead of creating a list ahead. Something like:

dir_path = os.path.join(data_dir, 'master_*.dat')
master_all = pd.concat(pd.read_table(data_file, delimiter='|', header=0)
                                     for data_file in glob.glob(dir_path))

Or you can write a generator function for a more verbose version.

Anyway this wont solve the problem if the RAM is not enough to contain the final result + some temp space for at list a complete file(and probably more... it depends on how the garbage collector works).

Memory efficient import many data files into pandas DataFrame in Python

Answers (1)

Related Questions