LookIntoEast
LookIntoEast

Reputation: 8808

"for line in file object" method to read files

I'm trying to find out the best way to read/process lines for super large file. Here I just try

for line in f:

Part of my script is as below:

o=gzip.open(file2,'w')
LIST=[]
f=gzip.open(file1,'r'):
for i,line in enumerate(f):
   if i%4!=3:
      LIST.append(line)

   else:
      LIST.append(line)
      b1=[ord(x) for x in line]
      ave1=(sum(b1)-10)/float(len(line)-1)
      if (ave1 < 84):
         del LIST[-4:]
output1=o.writelines(LIST)

My file1 is around 10GB; and when I run the script, the memory usage just keeps increasing to be like 15GB without any output. That means the computer is still trying to read the whole file into memory first, right? This really makes no different than using readlines()

However in the post: Different ways to read large data in python Srika told me: The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

But obviously I still need to worry large files..I'm really confused. thx

edit: Every 4 lines is kind of group in my data. THe purpose is to do some calculations on every 4th line; and based on that calculation, decide if we need to append those 4 lines.So writing lines is my purpose.

Upvotes: 1

Views: 3979

Answers (5)

eyquem
eyquem

Reputation: 27585

If you do not use the with statement , you must close the file's handlers:

o.close()

f.close()

Upvotes: 0

pseudoramble
pseudoramble

Reputation: 2561

It looks like at the end of this function, you're taking all of the lines you've read into memory, and then immediately writing them to a file. Maybe you can try this process:

  1. Read the lines you need into memory (the first 3 lines).
  2. On the 4th line, append the line & perform your calculation.
  3. If your calculation is what you're looking for, flush the values in your collection to the file.
  4. Regardless of what follows, create a new collection instance.

I haven't tried this out, but it could maybe look something like this:

o=gzip.open(file2,'w')
f=gzip.open(file1,'r'):
LIST=[]

for i,line in enumerate(f):
   if i % 4 != 3:
      LIST.append(line)
   else:
      LIST.append(line)
      b1 = [ord(x) for x in line]
      ave1 = (sum(b1) - 10) / float(len(line) - 1

      # If we've found what we want, save them to the file
      if (ave1 >= 84):
         o.writelines(LIST)

      # Release the values in the list by starting a clean list to work with
      LIST = []

EDIT: As a thought though, since your file is so large, this may not be the best technique because of all the lines you would have to write to file, but it may be worth investigating regardless.

Upvotes: 1

Derek Litz
Derek Litz

Reputation: 10897

Ok, you know what your problem is already from the other comments/answers, but let me simply state it.

You are only reading a single line at a time into memory, but you are storing a significant portion of these in memory by appending to a list.

In order to avoid this you need to store something in the filesystem or a database (on the disk) for later look up if your algorithm is complicated enough.

From what I see it seems you can easily write the output incrementally. ie. You are currently using a list to store valid lines to write to output as well as temporary lines you may delete at some point. To be efficient with memory you want to write the lines from your temporary list as soon as you know these are valid output.

In summary, use your list to store only temporary data you need to do your calculations based off of, and once you have some valid data ready for output you can simply write it to disk and delete it from your main memory (in python this would mean you should no longer have any references to it.)

Upvotes: 0

sth
sth

Reputation: 229663

Since you add all the lines to the list LIST and only sometimes remove some lines from it, LIST we become longer and longer. All those lines that you store in LIST will take up memory. Don't keep all the lines around in a list if you don't want them to take up memory.

Also your script doesn't seem to produce any output anywhere, so the point of it all isn't very clear.

Upvotes: 0

Srikar Appalaraju
Srikar Appalaraju

Reputation: 73638

The reason the memory keeps inc. even after you use enumerator is because you are using LIST.append(line). That basically accumulates all the lines of the file in a list. Obviously its all sitting in-memory. You need to find a way to not accumulate lines like this. Read, process & move on to next.

One more way you could do is read your file in chunks (in fact reading 1 line at a time can qualify in this criteria, 1chunk == 1line), i.e. read a small part of the file process it then read next chunk etc. I still maintain that this is best way to read files in python large or small.

with open(...) as f:
    for line in f:
        <do something with line>

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered IO and memory management so you don't have to worry about large files.

Upvotes: 4

Related Questions