Matias
Matias

Reputation: 589

Speed up sed on a gz file

I'm using a script to sed a file and remove text this way:

gzip -cd /data/file.gz | sed 's/WITH (appendonly=true, compresstype=quicklz)//' | gzip > file_seeded.gz

It takes a lot of time to perform the operation on big files (50GB for example). Is the way I'm doing this the optimal way or there are alternatives to speed up the process?

Upvotes: 2

Views: 1311

Answers (2)

Mark Adler
Mark Adler

Reputation: 112442

There is no way to avoid recompressing the edited data, which dominates the execution time. All I can suggest would be to use gzip -1 or gzip -3 to speed up the compression at the cost of slightly larger output. You can also use pigz to make use of all of your cores.

Upvotes: 2

Ole Tange
Ole Tange

Reputation: 33725

Use the fact, that you can append multiple gzip files:

mysed() {
  sed 's/WITH (appendonly=true, compresstype=quicklz)//' | gzip
}
export -f mysed
gzip -cd /data/file.gz | parallel --pipe -k --block 50M mysed > file_seeded.gz

Adjust 50M until you find the value that works best. It depends on how fast I/O to /tmp is and how much RAM and CPU cache you have. The best value will most likely be between 1M and 1000M.

If time is more important than disk space use gzip -1.

Upvotes: 2

Related Questions