justaguy
justaguy

Reputation: 3022

awk to count and sum total using matching string from file

I am trying to get the total length of each matching string and the count of each match in a file using awk. The matching string in $5 is the count and the sum of each $3 - $2 is the total length. Hopefully the awk below is a good start. Thank you :).

input

chr1 1266716 1266926 chr1:1266716-1266926 TAS1R3
chr1 1267008 1267328 chr1:1267008-1267328 TAS1R3
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3
chr1 1268291 1268514 chr1:1268291-1268514 TAS1R3
chr1 1956371 1956503 chr1:1956371-1956503 GABRD
chr1 1956747 1956866 chr1:1956747-1956866 GABRD
chr1 1956947 1957187 chr1:1956947-1957187 GABRD
chr1 1220077 1220196 chr1:1220077-1220196 SCNN1D

desired output

TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119

awk

awk '{count[$5]++}
END {
  for (word in count)
    print $1,$2,$3,$4,word, count[word]
}' input > count | 
awk 'print $1,$2,$3,$4,word, count[word]
}
{ $6 = $3 - $2 }
1' count.txt > length

edit

SCNN1D 1 119
GABRD 3 240
TAS1R3 4 223 

Upvotes: 1

Views: 620

Answers (3)

Ed Morton
Ed Morton

Reputation: 203368

$ cat tst.awk
$5 != prev { if (NR>1) print prev, cnt, sum; prev=$5; cnt=sum=0 }
{ cnt++; sum+=($3-$2) }
END { print prev, cnt, sum }

$ awk -f tst.awk file
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119

Upvotes: 1

dawg
dawg

Reputation: 103814

You can do:

awk '{c1[$5]++; c2[$5]+=($3-$2)} 
     END{for (e in c1) print e, c1[e], c2[e]}' input

Note that the order of the records may be different than the order in the original file.

Upvotes: 2

Henk Langeveld
Henk Langeveld

Reputation: 8446

With awk, it's possible to do the entire thing in a single script, by keeping a running count of both the cumulative length, and the number of instances for each word.

Try this (yet untested):

awk '{
  offset1=$2; offset2=$3; word=$5
  TotalLength[word]=offset2 - offset1 # or just $3-$2
  count[word]++}
END {
  for (word in count)
    print word, count[word], TotalLength[word]
}' input

The original script had three errors.

  1. The second awk chunk had an ambiguous input specification: Reading from pipe and a file argument (count.txt). In this case, awk cannot decide where to read from.
  2. In an END section, the numbered fields will only refer to the fields of the last line/record read. This is not what you want.
  3. Finally, the second awk script is missing the opening brace { for the print statement.

Upvotes: 1

Related Questions