akhil pathirippilly
akhil pathirippilly

Reputation: 1060

Python : Deque of COLLECTION library is taking time and memory while dealing with Large volume

I have written a file to file validation for my ETL project using Core Python APIs. It has methods for the duplicate check, count check, file size check, line by line comparison and logging the conflicts into another output file. I am using 'collection' library objects: Counter and deque instead of normal list in methods. It's working very fine. But for files of size 40 Million and above, its taking 6 to 7 minutes for running entire validation. When I have debugged the performance of methods and main operation, I found that below line in which contents of a file are converted into a deque is taking 3 to 4 minutes.

with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
    src = deque(source.read().splitlines())
    tgt = deque(target.read().splitlines())

So here I need to do some tuning. I would like to get help on below points

  1. What is the efficient way of writing the contents of a large file into a collection object
  2. How can I reduce memory utilization while handling collection objects of large volume
  3. Whether deque.clear() will release the memory too?
  4. If I have created a collection object A and stored some data. Then I have cleared the contents of it and then created another collection object B and stored some data. Like this, If I keep clearing collection objects after use, whether this going to help the performance of the program

Expecting some helping hands here

Upvotes: 0

Views: 376

Answers (1)

Raymond Hettinger
Raymond Hettinger

Reputation: 226754

You can skip the read() step and the splitlines() step, both of which consume memory. The file objects are directly iterable:

with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
    src = deque(source)
    tgt = deque(target)

In general, the space consumed by the deques is small only a fraction of the space consumed by all the strings referred to by the deque (on a 64-bit build, a pointer in a deque takes 8 bytes while even small string takes at least 50 bytes).

So if memory is still tight, consider interning the strings to eliminate excess space caused by duplicate strings:

from sys import intern
with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
    src = deque(map(intern, source))
    tgt = deque(map(intern, target))

With respect to running time, usually CPU speed is much faster than disk access time, so the program may be I/O bound. In that case, there isn't much you can do to improve speed short of moving to a faster input source.

Upvotes: 1

Related Questions