AShah
AShah

Reputation: 942

Maintaining variables in function - Global variables

Im trying to run a script in a function and then calling it

filedetails ()
{
   # read TOTAL_DU < "/tmp/sizes.out";
    disksize=`du -s "$1" | awk '{print $1}'`;
    let TOTAL_DU+=$disksize;
    echo "$TOTAL_DU";
   # echo $TOTAL_DU > "/tmp/sizes.out"
}

Im using the variable TOTAL_DU as a counter to keep count of the du of all the files

Im running it using parallel or xargs

find . -type f | parallel -j 8 filedetails

But the variable TOTAL_DU is resetting every time and the count is not maintained which is as expected as a new shell is used each time. I've also tried using a file to export and then read the counter but because of parallel some complete faster than others so its not sequential (as expected) so this is no good.... Question in is there a way to keep the count whilst using parallel or xargs

Upvotes: 0

Views: 268

Answers (1)

rici
rici

Reputation: 241951

Aside from learning purposes, this is not likely to be a good use of parallel, because:

  1. Calling du like that will quite possibly be slower than just invoking du in the normal way. First, the information about files sizes can be extracted from the directory, and so an entire directory can be computed in a single access. Effectively, directories are stored as a special kind of file object, whose data is a vector of directory entities ("dirents"), which contain the name and metadata for each file. What you are doing is using find to print these dirents, then getting du to parse each one (every file, not every directory); almost all of this second scan is redundant work.

  2. Insisting that du examine every file prevents it from avoiding double-counting multiple hard-links to the same file. So you can easily end up inflating the disk usage this way. On the other hand, directories also take up diskspace, and normally du will include this space in its reports. But you're never calling it on any directory, so you will end up understating the total disk usage.

  3. You're invoking a shell and an instance of du for every file. Normally, you would only create a single process for a single du. Process creation is a lot slower than reading a filesize from a directory. At a minimum, you should use parallel -X and rewrite your shell function to invoke du on all the arguments, rather than just $1.

  4. There is no way to share environment variables between sibling shells. So you would have to accumulate the results in a persistent store, such as a temporary file or database table. That's also an expensive operation, but if you adopted the above suggestion, you would only need to do it once for each invocation of du, rather than for every file.

So, ignoring the first two issues, and just looking at the last two, solely for didactic purposes, you could do something like the following:

# Create a temporary file to store results
tmpfile=$(mktemp)
# Function which invokes du and safely appends its summary line
# to the temporary file
collectsizes() {
  # Get the name of the temporary file, and remove it from the args
  tmpfile=$1
  shift
  # Call du on all the parameters, and get the last (grand total) line
  size=$(du -c -s "$@" | tail -n1)
  # lock the temporary file and append the dataline under lock
  flock "$tmpfile" bash -c 'cat "$1" >> "$2"' _ "$size" "$tmpfile"
}
export -f collectsizes

# Find all regular files, and feed them to parallel taking care
# to avoid problems if files have whitespace in their names
find -type f -print0 | parallel -0 -j8 collectsizes "$tmpfile"
# When all that's done, sum up the values in the temporary file
awk '{s+=$1}END{print s}' "$tmpfile"
# And delete it.
rm "$tmpfile"

Upvotes: 3

Related Questions