Reputation: 1060
I have written a file to file validation for my ETL project using Core Python APIs. It has methods for the duplicate check, count check, file size check, line by line comparison and logging the conflicts into another output file. I am using 'collection' library objects: Counter and deque instead of normal list in methods. It's working very fine. But for files of size 40 Million and above, its taking 6 to 7 minutes for running entire validation. When I have debugged the performance of methods and main operation, I found that below line in which contents of a file are converted into a deque is taking 3 to 4 minutes.
with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
src = deque(source.read().splitlines())
tgt = deque(target.read().splitlines())
So here I need to do some tuning. I would like to get help on below points
Expecting some helping hands here
Upvotes: 0
Views: 376
Reputation: 226754
You can skip the read() step and the splitlines() step, both of which consume memory. The file objects are directly iterable:
with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
src = deque(source)
tgt = deque(target)
In general, the space consumed by the deques is small only a fraction of the space consumed by all the strings referred to by the deque (on a 64-bit build, a pointer in a deque takes 8 bytes while even small string takes at least 50 bytes).
So if memory is still tight, consider interning the strings to eliminate excess space caused by duplicate strings:
from sys import intern
with open(sys.argv[1]) as source,open(sys.argv[2]) as target:
src = deque(map(intern, source))
tgt = deque(map(intern, target))
With respect to running time, usually CPU speed is much faster than disk access time, so the program may be I/O bound. In that case, there isn't much you can do to improve speed short of moving to a faster input source.
Upvotes: 1