Reputation: 10431
In Linux bash, I would like to be able to decompress a large tar.gz (100G-1T, hundreds of similarly sized files), so that after each file has succeeded the decompression, I can pass it through a bash loop for further processing. See example below with --desired_flag
:
tar xzf --desired_flag large.tar.gz \
| xargs -n1 -P8 -I % do_something_to_decompressed_file %
EDIT: the immediate use case I am thinking about is a network operation, where as soon as the contents of the files being decompressed are available, they can be uploaded somewhere on the next step. Given that the tar step could be either CPU-bound or IO-bound depending on the Linux instance, I would like to be able to efficiently pass the files to the next step, which I presume will be bound by network speed.
Upvotes: 0
Views: 408
Reputation: 295288
Given the following function definition:
buffer_lines() {
local last_name file_name
read -r last_name || return
while read -r file_name; do
printf '%s\n' "$last_name"
last_name=$file_name
done
printf '%s\n' "$last_name"
}
...one can then run the following, whether one's tar
implementation prints names at the beginning or end of their processing:
tar xvzf large.tar.gz | buffer_lines | xargs -d $'\n' -n 1 -P8 do_something_to_file
Note the v
flag, telling tar
to print filenames on stdout (in the GNU implementation, in this particular usage mode). Also note the lack of the -I
argument.
If you want to insert a buffer (to allow tar
to run ahead of the xargs
process), consider pv
:
tar xvzf large.tar.gz \
| pv -B 1M \
| buffer_lines \
| xargs -d $'\n' -n 1 -P8 do_something_to_file
...will buffer up to 1MB of unpacked names should the processing components run behind.
Upvotes: 2