Reputation: 10574
I'm working on optimizing a Python script that needs to parse a huge (12 TB) amount of data. At the moment, it basically looks like:
gzip -d -c big_file.gz | sed /regex|of|interesting|things/p | script.py
(actually, the piping is being done by subprocess.Popen
, but I don't think that's important -- correct me if I'm wrong.)
It appears that the gzip->sed->python
pipes are currently the most time consuming part of the script. I assume that this is because there are three separate processes in play here: since none of them can have a shared address space, any information that needs to be passed between them needs to actually be copied from one to the other, so the three pipes result in a total of at most 36 TB being pushed through my RAM rather than just 12.
Am I understanding correctly what's going on?
Upvotes: 3
Views: 155
Reputation: 328566
The pipes probably aren't your problem. Modern PCs can copy memory at rates of 70GB/s.
If you want to know how much time the first stage takes, run:
time gunzip big_file.gz | sed '/regex|of|interesting|things/p' > /dev/null
That will unpack the data and filter it and then tell you how long that took.
My guess is that the poor Python script gets too much data and processing huge amounts of data with Python simply takes time.
[EDIT] I just noticed something: The Python docs say:
bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered [...] The default value for bufsize is 0 (unbuffered).
Try with bufsize=-1
or bufsize=102400
when you create the pipes.
Lesson to take from this: Buffered pipes are fast, unbuffered pipes are slow.
Upvotes: 2
Reputation: 27822
Let's first run a little test:
time dd if=/dev/zero of=/dev/null bs=2M count=5000
time dd if=/dev/zero bs=2M count=5000 > /dev/null
time dd if=/dev/zero bs=2M count=5000 | cat > /dev/null
time dd if=/dev/zero bs=2M count=5000 | cat | cat | cat | cat | cat > /dev/null
Results:
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 0.651287 s, 16.1 GB/s
real 0m0.653s
user 0m0.000s
sys 0m0.650s
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 0.585887 s, 17.9 GB/s
real 0m0.587s
user 0m0.007s
sys 0m0.580s
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 8.55412 s, 1.2 GB/s
real 0m8.556s
user 0m0.077s
sys 0m9.267s
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 9.69067 s, 1.1 GB/s
real 0m9.692s
user 0m0.397s
sys 0m25.617s
Adding a single pipe decreases performance hugely; adding multiple pipes decreases performance slightly. Results seem consistent across multiple runs.
I need to investigate more as to the why when I have more time, my guess is that the cat
process read data with a small buffer, so the dd
process writes slower.
There's a program called bfr
which aims to solve this; I have never tried it. Last update is from 2004...
You could also try to implement gzip
& a Python string replace. It's difficult to predict if the performance gain will be worth the time, though...
Upvotes: 3