Reputation: 372
I am currently experiencing an issue while reading big files with Python 2.7 [GCC 4.9] on Ubuntu 14.04 LTS, 32-bit. I read other posts on the same topic, such as Reading a large file in python , and tried to follow their advice, but I still obtain MemoryErrors.
The file I am attempting to read is not that big (~425MB), so first I tried a naive block of code like:
data = []
isFirstLine = True
lineNumber = 0
print "Reading input file \"" + sys.argv[1] + "\"..."
with open(sys.argv[1], 'r') as fp :
for x in fp :
print "Now reading line #" + str(lineNumber) + "..."
if isFirstLine :
keys = [ y.replace('\"', '') for y in x.rstrip().split(',') ]
isFirstLine = False
else :
data.append( x.rstrip().split(',') )
lineNumber += 1
The code above crashes around line #3202 (of 3228), with output:
Now reading line #3200...
Now reading line #3201...
Now reading line #3202...
Segmentation fault (core dumped)
I tried invoking gc.collect()
after reading every line, but I got the same error (and the code became slower). Then, following some indications I found here on StackOverflow, I tried numpy.loadtxt()
:
data = numpy.loadtxt(sys.argv[1], skiprows=1, delimiter=',')
This time, I got a slightly more verbose error:
Traceback (most recent call last):
File "plot-memory-efficient.py", line 360, in <module>
if __name__ == "__main__" : main()
File "plot-memory-efficient.py", line 40, in main
data = numpy.loadtxt(sys.argv[1], skiprows=1, delimiter=',')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 856, in loadtxt
X = np.array(X, dtype)
MemoryError
So, I am under the impression that something is not right. What am I missing? Thanks in advance for your help!
UPDATE
Following hd1's answer below, I tried the csv
module, and it worked. However, I think there's something important that I might have overlooked: I was parsing each line, and I was actually storing the values as strings. Using csv
like this still causes some errors:
with open(sys.argv[1], 'r') as fp :
reader = csv.reader(fp)
# get the header
keys = reader.next()
for line in reader:
print "Now reading line #" + str(lineNumber) + "..."
data.append( line )
lineNumber += 1
But storing the values as float
solves the issue!
with open(sys.argv[1], 'r') as fp :
reader = csv.reader(fp)
# get the header
keys = reader.next()
for line in reader:
print "Now reading line #" + str(lineNumber) + "..."
floatLine = [float(x) for x in line]
data.append( floatLine )
lineNumber += 1
So, another issue might be connected with the data structures.
Upvotes: 0
Views: 266
Reputation: 34677
numpy's loadtxt method is known to be memory-inefficient. That may address your first problem. Per the second, why not use the csv module:
data = []
isFirstLine = True
lineNumber = 0
print "Reading input file \"" + sys.argv[1] + "\"..."
with open(sys.argv[1], 'r') as fp :
reader = csv.reader(fp)
reader.next()
for line in reader:
pass
# line is an array of comma-delimited fields in the file
Upvotes: 1