mbadawi23
mbadawi23

Reputation: 1071

How to calculate average of subsets of columns in csv?

I have a very large CSV file that looks something as follows:

#       col1    col2    col3
1       1       7       9
2       2       8       10
3       3       9       11
4       4       10      12
5       5       11      13
6       6       12      14

For all columns, I would like to calculate the average of each consecutive two fields, then offset to the next two. For instance, in col1 average of 1 and 2 is the first cell of the resulting column, the average of 3 and 4 is the second cell of the resulting column. So, the new column size is half the original col1.

The output of the script should look as follows for the provided sample file above:

#       col1    col2    col3
1       1.5     7.5     9.5
2       3.5     9.5     11.5
3       5.5     11.5    13.5

This problem seems like a good [one] to be solved with AWK, but I'm still new to using AWK.

Any pointers are appreciated.

Upvotes: 2

Views: 781

Answers (2)

mbadawi23
mbadawi23

Reputation: 1071

I took the liberty to generalize Jonathan Leffler's answer to cover the Nth case for size of the average window and offset.

I wrote an awk script (I called it avewithoffset) as follows:

#!bin/awk
BEGIN{
    FS=OFS="\t";
    n=5; }
NR==1 { print; next;}
(NR-1)%n!=0 { for (i = 2; i <= NF; i++) old[i] += $i; }
(NR-1)%n==0 { for (i = 2; i <= NF; i++)
              { $i = ($i + old[i])/n; old[i] = 0; }
              $1 = int( (NR-1)/n );
              print; }

Notice that n=5.

I fed the following file to it:

#   col1    col2    col3
1   1       16      31
2   2       17      32
3   3       18      33
4   4       19      34
5   5       20      35
6   6       21      36
7   7       22      37
8   8       23      38
9   9       24      39
10  10      25      40
11  11      26      41
12  12      27      42
13  13      28      43
14  14      29      44
15  15      30      45

And the resulting file looks like:

#   col1    col2    col3
1   3       18      33
2   8       23      38
3   13      28      43

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 753970

It can be done with awk.

awk 'BEGIN   { OFS = "\t" }
     NR  ==1 { print; next } # Print header
     NR%2==0 { for (i = 2; i <= NF; i++) old[i] = $i; }
     NR%2==1 { for (i = 2; i <= NF; i++) $i = ($i + old[i])/2
               $1 = (NR-1)/2; print }'
  1. Set the output field separator to tab.
  2. Print the header line and skip to the next line.
  3. For the even lines, save the values in fields 2 to the end in the old array.
  4. For the odd lines (after the first), calculate the average of the old and current field values. Set the line number. Print the result.

Sample output:

#       col1    col2    col3
1       1.5     7.5     9.5
2       3.5     9.5     11.5
3       5.5     11.5    13.5

Generalizing for groups of N rows

This script takes an argument which is the number of rows to group together, defaulting to 2 if no argument is specified. As outlined in a comment, the code needs to reset the old array values to 0, and sum values rather than assign.

$ cat x.awk
awk -v N=${1:-2} \
    'BEGIN   { OFS = "\t" }
     NR  ==1 { print; next } # Print header
     NR%N!=1 { for (i = 2; i <= NF; i++) old[i] += $i }
     NR%N==1 { for (i = 2; i <= NF; i++) $i = ($i + old[i])/N
               $1 = int((NR-1)/N)
               print
               for (i = 2; i <= NF; i++) old[i] = 0
             }' data
$ cat data
#       col1    col2    col3
1       1       7       9
2       2       8       10
3       3       9       11
4       4       10      12
5       5       11      13
6       6       12      14
7       7       14      17
8       8       16      19
9       9       18      22
10      10      20      26
11      11      22      28
12      12      24      29
$ bash x.awk 2
#       col1    col2    col3
1       1.5     7.5     9.5
2       3.5     9.5     11.5
3       5.5     11.5    13.5
4       7.5     15      18
5       9.5     19      24
6       11.5    23      28.5
$ bash x.awk 3
#       col1    col2    col3
1       2       8       10
2       5       11      13
3       8       16      19.3333
4       11      22      27.6667
$ bash x.awk 4
#       col1    col2    col3
1       2.5     8.5     10.5
2       6.5     13.25   15.75
3       10.5    21      26.25
$ bash x.awk 6
#       col1    col2    col3
1       3.5     9.5     11.5
2       9.5     19      23.5
$

If you want partial groups printed out at the end, add an appropriate END block, which will need to divide by the number of partial rows rather than the number of rows.

Upvotes: 2

Related Questions