Reputation: 1071
I have a very large CSV file that looks something as follows:
# col1 col2 col3
1 1 7 9
2 2 8 10
3 3 9 11
4 4 10 12
5 5 11 13
6 6 12 14
For all columns, I would like to calculate the average of each consecutive two fields, then offset to the next two. For instance, in col1
average of 1
and 2
is the first cell of the resulting column, the average of 3
and 4
is the second cell of the resulting column. So, the new column size is half the original col1
.
The output of the script should look as follows for the provided sample file above:
# col1 col2 col3
1 1.5 7.5 9.5
2 3.5 9.5 11.5
3 5.5 11.5 13.5
This problem seems like a good [one] to be solved with AWK, but I'm still new to using AWK.
Any pointers are appreciated.
Upvotes: 2
Views: 781
Reputation: 1071
I took the liberty to generalize Jonathan Leffler's answer to cover the Nth
case for size of the average window and offset.
I wrote an awk
script (I called it avewithoffset
) as follows:
#!bin/awk
BEGIN{
FS=OFS="\t";
n=5; }
NR==1 { print; next;}
(NR-1)%n!=0 { for (i = 2; i <= NF; i++) old[i] += $i; }
(NR-1)%n==0 { for (i = 2; i <= NF; i++)
{ $i = ($i + old[i])/n; old[i] = 0; }
$1 = int( (NR-1)/n );
print; }
Notice that n=5
.
I fed the following file to it:
# col1 col2 col3
1 1 16 31
2 2 17 32
3 3 18 33
4 4 19 34
5 5 20 35
6 6 21 36
7 7 22 37
8 8 23 38
9 9 24 39
10 10 25 40
11 11 26 41
12 12 27 42
13 13 28 43
14 14 29 44
15 15 30 45
And the resulting file looks like:
# col1 col2 col3
1 3 18 33
2 8 23 38
3 13 28 43
Upvotes: 1
Reputation: 753970
It can be done with awk
.
awk 'BEGIN { OFS = "\t" }
NR ==1 { print; next } # Print header
NR%2==0 { for (i = 2; i <= NF; i++) old[i] = $i; }
NR%2==1 { for (i = 2; i <= NF; i++) $i = ($i + old[i])/2
$1 = (NR-1)/2; print }'
old
array.Sample output:
# col1 col2 col3
1 1.5 7.5 9.5
2 3.5 9.5 11.5
3 5.5 11.5 13.5
This script takes an argument which is the number of rows to group together, defaulting to 2 if no argument is specified. As outlined in a comment, the code needs to reset the old
array values to 0, and sum values rather than assign.
$ cat x.awk
awk -v N=${1:-2} \
'BEGIN { OFS = "\t" }
NR ==1 { print; next } # Print header
NR%N!=1 { for (i = 2; i <= NF; i++) old[i] += $i }
NR%N==1 { for (i = 2; i <= NF; i++) $i = ($i + old[i])/N
$1 = int((NR-1)/N)
print
for (i = 2; i <= NF; i++) old[i] = 0
}' data
$ cat data
# col1 col2 col3
1 1 7 9
2 2 8 10
3 3 9 11
4 4 10 12
5 5 11 13
6 6 12 14
7 7 14 17
8 8 16 19
9 9 18 22
10 10 20 26
11 11 22 28
12 12 24 29
$ bash x.awk 2
# col1 col2 col3
1 1.5 7.5 9.5
2 3.5 9.5 11.5
3 5.5 11.5 13.5
4 7.5 15 18
5 9.5 19 24
6 11.5 23 28.5
$ bash x.awk 3
# col1 col2 col3
1 2 8 10
2 5 11 13
3 8 16 19.3333
4 11 22 27.6667
$ bash x.awk 4
# col1 col2 col3
1 2.5 8.5 10.5
2 6.5 13.25 15.75
3 10.5 21 26.25
$ bash x.awk 6
# col1 col2 col3
1 3.5 9.5 11.5
2 9.5 19 23.5
$
If you want partial groups printed out at the end, add an appropriate END
block, which will need to divide by the number of partial rows rather than the number of rows.
Upvotes: 2