Reputation: 589
I'm using a script to sed
a file and remove text this way:
gzip -cd /data/file.gz | sed 's/WITH (appendonly=true, compresstype=quicklz)//' | gzip > file_seeded.gz
It takes a lot of time to perform the operation on big files (50GB for example). Is the way I'm doing this the optimal way or there are alternatives to speed up the process?
Upvotes: 2
Views: 1311
Reputation: 112442
There is no way to avoid recompressing the edited data, which dominates the execution time. All I can suggest would be to use gzip -1
or gzip -3
to speed up the compression at the cost of slightly larger output. You can also use pigz to make use of all of your cores.
Upvotes: 2
Reputation: 33725
Use the fact, that you can append multiple gzip files:
mysed() {
sed 's/WITH (appendonly=true, compresstype=quicklz)//' | gzip
}
export -f mysed
gzip -cd /data/file.gz | parallel --pipe -k --block 50M mysed > file_seeded.gz
Adjust 50M
until you find the value that works best. It depends on how fast I/O to /tmp is and how much RAM and CPU cache you have. The best value will most likely be between 1M and 1000M.
If time is more important than disk space use gzip -1
.
Upvotes: 2