justaguy
justaguy

Reputation: 3022

awk array to output the line count as well as average

Thanks to @karakfa the below awk array results in the output. I am trying to add $2 to the array and output that as well. $2 is basically the amount of times the unique entry appears. As I am leaaring awk arrays I do not know if my attempt is close.

Input:

chr1:955542-955763  AGRN:exon.1 1   0
chr1:955542-955763  AGRN:exon.1 2   0
chr1:985542-985763  AGRN:exon.2 1   0
chr1:985542-985763  AGRN:exon.2 2   1

My script:

awk '{k=$1 OFS $2;
    l=$2;  # Is this correct?
    s[k]+=$4; c[k]++}
  END{for(i in s)  # Is this correct?
    print i, s[i]/c[i]},
      "(lbases)"  # Is this correct?' input

Current output:

chr1:955542-955763 AGRN:exon.1 0
chr1:985542-985763 AGRN:exon.2 0.5

Desired output:

chr1:955542-955763 AGRN:exon.1 0   (2 bases)
chr1:985542-985763 AGRN:exon.2 0.5 (2 bases)

Upvotes: 3

Views: 228

Answers (1)

tripleee
tripleee

Reputation: 189317

Your attempt to introduce a new variable is not going to work. You need a count per array key, so the variable should be another array. But in this case, you don't need to add a new array, because the array c already contains the count per key.

awk '{k=$1 OFS $2;
    s[k]+=$4; c[k]++}
  END{for(i in s)
    print i, s[i]/c[i], c[i] " bases" }' input

Notice also how your attempt unhappily had the "bases" outside the closing brace of the END block.

This differs from the problem description in that the key is not $2, but the combination of $1 and $2. If you genuinely need the key to be solely $2, you do need a new array, but then the whole thing will get quite a bit more complex.

Upvotes: 4

Related Questions