Python : Deque of COLLECTION library is taking time and memory while dealing with Large volume

Question

I have written a file to file validation for my ETL project using Core Python APIs. It has methods for the duplicate check, count check, file size check, line by line comparison and logging the conflicts into another output file. I am using 'collection' library objects: Counter and deque instead of normal list in methods. It's working very fine. But for files of size 40 Million and above, its taking 6 to 7 minutes for running entire validation. When I have debugged the performance of methods and main operation, I found that below line in which contents of a file are converted into a deque is taking 3 to 4 minutes.

with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
    src = deque(source.read().splitlines())
    tgt = deque(target.read().splitlines())

So here I need to do some tuning. I would like to get help on below points

What is the efficient way of writing the contents of a large file into a collection object
How can I reduce memory utilization while handling collection objects of large volume
Whether deque.clear() will release the memory too?
If I have created a collection object A and stored some data. Then I have cleared the contents of it and then created another collection object B and stored some data. Like this, If I keep clearing collection objects after use, whether this going to help the performance of the program

Expecting some helping hands here

Python : Deque of COLLECTION library is taking time and memory while dealing with Large volume

Answers (1)

Related Questions