ravi3
ravi3

Reputation: 11

Average values across multiple files

I am trying to write a shell script to average several identically formatted files with names file1, file2, file3 and so on.

In each file, the data is in a table of a format like for example 4 columns and 5 rows of data. Let's assume file1, file2 and file3 are in the same directory. What I want to do is to create an average file, which has the same format as file1/file2/file3 where it should have the average of the each element from the table. For example,

{(Element in row 1, column 1 in file1)+
 (Element in row 1, column 1 in file2)+
 (Element in row 1, column 1 in file3)} >> 
(Element in row 1, column 1 in average file)

Likewise, I need to do it for each element in the table, the average file has the same number of elements as the file1, file2, file3.

I tried to write a shell script, but it doesn't work. What I want is to read the files in a loop and grep the same element from each file, add them and average them over the number of files and finally write to a similar file format. This is what I tried to write:

#!/bin/bash       
s=0
for i in {1..5..1} do
    for j in {1..4..1} do
        for f in m* do
            a=$(awk 'FNR == i {print $j}' $f)
            echo $a
            s=$s+$a
            echo $f
        done
        avg=$s/3
        echo $avg > output
    done
done

Upvotes: 1

Views: 2152

Answers (1)

Benjamin W.
Benjamin W.

Reputation: 52556

This is a rather inefficient way of going about it: for every single number you're trying to extract, you process one of the input files completely – even though you only have three files, you process 60!

Also, mixing Bash and awk in this way is a massive antipattern. This here is a great Q&A explaining why.

A few more remarks:

  • For brace expansion, the default step size is 1, so {1..4..1} is the same as {1..4}.
  • Awk has no clue what i and j are. As far as it is concerned, those were never defined. If you really wanted to get your shell variables into awk, you could do

    a=$(awk -v i="$i" -v j="$j" 'FNR == i { print $j }' $f)
    

    but the approach is not sound anyway.

  • Shell arithmetic does not work like s=$s+$a or avg=$s/3 – these are just concatenating strings. To have the shell do calculations for you, you'd need arithmetic expansion:

    s=$(( s + a ))
    

    or, a little shorter,

    (( s += a ))
    

    and

    avg=$(( s / 3 ))
    

    Notice that you don't need the $ signs in an arithmetic context.

  • echo $avg > output would print every number on a separate line, which is probably not what you want.
  • Indentation matters! If not for the machine, then for human readers.

A Bash solution

This solves the problem using just Bash. It is hard coded to three files, but flexible in the number of lines and elements per line. There are no checks to make sure that the number of elements is the same for all lines and files.

Notice that Bash is not fast at that kind of thing and should only be used for small files, if at all. Also, is uses integer arithmetic, so the "average" of 3 and 4 would become 3.

I've added comments to explain what happens.

#!/bin/bash

# Read a line from the first file into array arr1
while read -a arr1; do

    # Read a line from the second file at file descriptor 3 into array arr2
    read -a arr2 <&3

    # Read a line from the third file at file descriptor 4 into array arr3
    read -a arr3 <&4

    # Loop over elements
    for (( i = 0; i < ${#arr1[@]}; ++i )); do

        # Calculate average of element across files, assign to res array
        res[i]=$(( (arr1[i] + arr2[i] + arr3[i]) / 3 ))
    done

    # Print res array
    echo "${res[@]}"

# Read from files supplied as arguments
# Input for the second and third file is redirected to file descriptors 3 and 4
# to enable looping over multiple files concurrently
done < "$1" 3< "$2" 4< "$3"

This has to be called like

./bashsolution file1 file2 file3

and output can be redirected as desired.

An awk solution

This is a solution in pure awk. It's a bit more flexible in that it takes the average of however many files are supplied as arguments; it should also be faster than the Bash solution by about an order of magnitude.

#!/usr/bin/awk -f

# Count number of files: increment on the first line of each new file
FNR == 1 { ++nfiles }

{
    # (Pseudo) 2D array summing up fields across files
    for (i = 1; i <= NF; ++i) {
        values[FNR, i] += $i
    }
}

END {
    # Loop over lines of array with sums
    for (i = 1; i <= FNR; ++i) {

        # Loop over fields of current line in array of sums
        for (j = 1; j <= NF; ++j) {

            # Build record with averages
            $j = values[i, j]/nfiles
        }
        print
    }
}

It has to be called like

./awksolution file1 file2 file3

and, as mentioned, there is no limit to the number of files to average over.

Upvotes: 1

Related Questions