Reputation: 11
I am trying to write a shell script to average several identically formatted files with names file1
, file2
, file3
and so on.
In each file, the data is in a table of a format like for example 4 columns and 5 rows of data. Let's assume file1
, file2
and file3
are in the same directory. What I want to do is to create an average file, which has the same format as file1
/file2
/file3
where it should have the average of the each element from the table. For example,
{(Element in row 1, column 1 in file1)+
(Element in row 1, column 1 in file2)+
(Element in row 1, column 1 in file3)} >>
(Element in row 1, column 1 in average file)
Likewise, I need to do it for each element in the table, the average file has the same number of elements as the file1
, file2
, file3
.
I tried to write a shell script, but it doesn't work. What I want is to read the files in a loop and grep the same element from each file, add them and average them over the number of files and finally write to a similar file format. This is what I tried to write:
#!/bin/bash
s=0
for i in {1..5..1} do
for j in {1..4..1} do
for f in m* do
a=$(awk 'FNR == i {print $j}' $f)
echo $a
s=$s+$a
echo $f
done
avg=$s/3
echo $avg > output
done
done
Upvotes: 1
Views: 2152
Reputation: 52556
This is a rather inefficient way of going about it: for every single number you're trying to extract, you process one of the input files completely – even though you only have three files, you process 60!
Also, mixing Bash and awk in this way is a massive antipattern. This here is a great Q&A explaining why.
A few more remarks:
{1..4..1}
is the same as {1..4}
.Awk has no clue what i
and j
are. As far as it is concerned, those were never defined. If you really wanted to get your shell variables into awk, you could do
a=$(awk -v i="$i" -v j="$j" 'FNR == i { print $j }' $f)
but the approach is not sound anyway.
Shell arithmetic does not work like s=$s+$a
or avg=$s/3
– these are just concatenating strings. To have the shell do calculations for you, you'd need arithmetic expansion:
s=$(( s + a ))
or, a little shorter,
(( s += a ))
and
avg=$(( s / 3 ))
Notice that you don't need the $
signs in an arithmetic context.
echo $avg > output
would print every number on a separate line, which is probably not what you want.This solves the problem using just Bash. It is hard coded to three files, but flexible in the number of lines and elements per line. There are no checks to make sure that the number of elements is the same for all lines and files.
Notice that Bash is not fast at that kind of thing and should only be used for small files, if at all. Also, is uses integer arithmetic, so the "average" of 3 and 4 would become 3.
I've added comments to explain what happens.
#!/bin/bash
# Read a line from the first file into array arr1
while read -a arr1; do
# Read a line from the second file at file descriptor 3 into array arr2
read -a arr2 <&3
# Read a line from the third file at file descriptor 4 into array arr3
read -a arr3 <&4
# Loop over elements
for (( i = 0; i < ${#arr1[@]}; ++i )); do
# Calculate average of element across files, assign to res array
res[i]=$(( (arr1[i] + arr2[i] + arr3[i]) / 3 ))
done
# Print res array
echo "${res[@]}"
# Read from files supplied as arguments
# Input for the second and third file is redirected to file descriptors 3 and 4
# to enable looping over multiple files concurrently
done < "$1" 3< "$2" 4< "$3"
This has to be called like
./bashsolution file1 file2 file3
and output can be redirected as desired.
This is a solution in pure awk. It's a bit more flexible in that it takes the average of however many files are supplied as arguments; it should also be faster than the Bash solution by about an order of magnitude.
#!/usr/bin/awk -f
# Count number of files: increment on the first line of each new file
FNR == 1 { ++nfiles }
{
# (Pseudo) 2D array summing up fields across files
for (i = 1; i <= NF; ++i) {
values[FNR, i] += $i
}
}
END {
# Loop over lines of array with sums
for (i = 1; i <= FNR; ++i) {
# Loop over fields of current line in array of sums
for (j = 1; j <= NF; ++j) {
# Build record with averages
$j = values[i, j]/nfiles
}
print
}
}
It has to be called like
./awksolution file1 file2 file3
and, as mentioned, there is no limit to the number of files to average over.
Upvotes: 1