awk to combine and average matching fields

Question

I am trying to use awk to output data in the following format.

`$4` is `last # in `$6` that matches `$4` and maps to `$5` with an average depth of `average of $7` that matches `$4`

input

chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    1   20
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    2   20
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    3   22
chr1    957571  957852  chr1:957571-957852  AGRN-7|gc=61.2  1   201
chr1    957571  957852  chr1:957571-957852  AGRN-7|gc=61.2  2   201
chr1    957571  957852  chr1:957571-957852  AGRN-7|gc=61.2  3   201
chr1    957571  957852  chr1:957571-957852  AGRN-7|gc=61.2  4   202

Desired output

chr1:955543-955763 is 3 bases and maps to AGRN-6|gc=75 with an average  depth of 20.6
chr1:957571-957852 is 4 bases and maps to AGRN-7|gc=61.2 with an average depth of 201.3

I think this awk is close and hopefully a good start. Thank you :).

awk '
    {N[$4]++
     T[$4]+=$6
     M[$4]=$7
    }
END     {for (X in N) printf ("%s is %d bases and maps to %s with an average depth"\
                            " of %f reads
", X, N[X], M[X], T[X]/N[X]);
    }
'  input.txt > output.txt

karakfa · Accepted Answer

this is a working prototype without formatting and words

$ awk '{k=$4 FS $5; a[k]+=$7; c[k]++} 
    END{for(k in a) 
          {split(k,ks,FS); 
           print ks[1],c[k],ks[2],a[k]/c[k]}}' file  

chr1:957571-957852 4 AGRN-7|gc=61.2 201.25
chr1:955543-955763 3 AGRN-6|gc=75 20.6667

add the missing words and do number formatting with printf if important. awk shuffles the array and losing the order but there is a fix for it if you use gawk

awk to combine and average matching fields

Answers (1)

Related Questions