Reputation: 3022
I am trying to use awk
to output data in the following format.
`$4` is `last # in `$6` that matches `$4` and maps to `$5` with an average depth of `average of $7` that matches `$4`
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 20
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 20
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 22
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 1 201
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 2 201
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 3 201
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 4 202
Desired output
chr1:955543-955763 is 3 bases and maps to AGRN-6|gc=75 with an average depth of 20.6
chr1:957571-957852 is 4 bases and maps to AGRN-7|gc=61.2 with an average depth of 201.3
I think this awk
is close and hopefully a good start. Thank you :).
awk '
{N[$4]++
T[$4]+=$6
M[$4]=$7
}
END {for (X in N) printf ("%s is %d bases and maps to %s with an average depth"\
" of %f reads\n", X, N[X], M[X], T[X]/N[X]);
}
' input.txt > output.txt
Upvotes: 0
Views: 50
Reputation: 67507
this is a working prototype without formatting and words
$ awk '{k=$4 FS $5; a[k]+=$7; c[k]++}
END{for(k in a)
{split(k,ks,FS);
print ks[1],c[k],ks[2],a[k]/c[k]}}' file
chr1:957571-957852 4 AGRN-7|gc=61.2 201.25
chr1:955543-955763 3 AGRN-6|gc=75 20.6667
add the missing words and do number formatting with printf if important. awk
shuffles the array and losing the order but there is a fix for it if you use gawk
Upvotes: 1