719016
719016

Reputation: 10431

How to pass through each file that's completed tar xzf decompression to a bash loop?

In Linux bash, I would like to be able to decompress a large tar.gz (100G-1T, hundreds of similarly sized files), so that after each file has succeeded the decompression, I can pass it through a bash loop for further processing. See example below with --desired_flag:

tar xzf --desired_flag large.tar.gz \
 | xargs -n1 -P8 -I % do_something_to_decompressed_file % 

EDIT: the immediate use case I am thinking about is a network operation, where as soon as the contents of the files being decompressed are available, they can be uploaded somewhere on the next step. Given that the tar step could be either CPU-bound or IO-bound depending on the Linux instance, I would like to be able to efficiently pass the files to the next step, which I presume will be bound by network speed.

Upvotes: 0

Views: 408

Answers (1)

Charles Duffy
Charles Duffy

Reputation: 295288

Given the following function definition:

buffer_lines() {
  local last_name file_name
  read -r last_name || return
  while read -r file_name; do
    printf '%s\n' "$last_name"
    last_name=$file_name
  done
  printf '%s\n' "$last_name"
}

...one can then run the following, whether one's tar implementation prints names at the beginning or end of their processing:

tar xvzf large.tar.gz | buffer_lines | xargs -d $'\n' -n 1 -P8 do_something_to_file

Note the v flag, telling tar to print filenames on stdout (in the GNU implementation, in this particular usage mode). Also note the lack of the -I argument.


If you want to insert a buffer (to allow tar to run ahead of the xargs process), consider pv:

tar xvzf large.tar.gz \
  | pv -B 1M \
  | buffer_lines \
  | xargs -d $'\n' -n 1 -P8 do_something_to_file

...will buffer up to 1MB of unpacked names should the processing components run behind.

Upvotes: 2

Related Questions