Alberto
Alberto

Reputation: 372

Issue with large files in Python 2.7

I am currently experiencing an issue while reading big files with Python 2.7 [GCC 4.9] on Ubuntu 14.04 LTS, 32-bit. I read other posts on the same topic, such as Reading a large file in python , and tried to follow their advice, but I still obtain MemoryErrors.

The file I am attempting to read is not that big (~425MB), so first I tried a naive block of code like:

data = []
isFirstLine = True
lineNumber = 0

print "Reading input file \"" + sys.argv[1] + "\"..."

with open(sys.argv[1], 'r') as fp :
        for x in fp :
                print "Now reading line #" + str(lineNumber) + "..."
                if isFirstLine :
                        keys = [ y.replace('\"', '') for y in x.rstrip().split(',') ]
                        isFirstLine = False
                else :
                        data.append( x.rstrip().split(',') )
                lineNumber += 1

The code above crashes around line #3202 (of 3228), with output:

Now reading line #3200...
Now reading line #3201...
Now reading line #3202...
Segmentation fault (core dumped)

I tried invoking gc.collect() after reading every line, but I got the same error (and the code became slower). Then, following some indications I found here on StackOverflow, I tried numpy.loadtxt():

data = numpy.loadtxt(sys.argv[1], skiprows=1, delimiter=',')

This time, I got a slightly more verbose error:

Traceback (most recent call last):
File "plot-memory-efficient.py", line 360, in <module>
  if __name__ == "__main__" : main()
File "plot-memory-efficient.py", line 40, in main
  data = numpy.loadtxt(sys.argv[1], skiprows=1, delimiter=',')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 856, in loadtxt
   X = np.array(X, dtype)
MemoryError

So, I am under the impression that something is not right. What am I missing? Thanks in advance for your help!

UPDATE

Following hd1's answer below, I tried the csv module, and it worked. However, I think there's something important that I might have overlooked: I was parsing each line, and I was actually storing the values as strings. Using csv like this still causes some errors:

        with open(sys.argv[1], 'r') as fp :
            reader = csv.reader(fp)

            # get the header
            keys = reader.next()

            for line in reader:
                    print "Now reading line #" + str(lineNumber) + "..."
                    data.append( line )
                    lineNumber += 1

But storing the values as float solves the issue!

        with open(sys.argv[1], 'r') as fp :
            reader = csv.reader(fp)

            # get the header
            keys = reader.next()

            for line in reader:
                    print "Now reading line #" + str(lineNumber) + "..."
                    floatLine = [float(x) for x in line]
                    data.append( floatLine )
                    lineNumber += 1

So, another issue might be connected with the data structures.

Upvotes: 0

Views: 266

Answers (1)

hd1
hd1

Reputation: 34677

numpy's loadtxt method is known to be memory-inefficient. That may address your first problem. Per the second, why not use the csv module:

data = []
isFirstLine = True
lineNumber = 0

print "Reading input file \"" + sys.argv[1] + "\"..."

with open(sys.argv[1], 'r') as fp :
    reader = csv.reader(fp)
    reader.next()
    for line in reader:
        pass  
        # line is an array of comma-delimited fields in the file

Upvotes: 1

Related Questions