Reputation: 3022
I am trying to get the total length of each matching string and the count of each match in a file using awk
. The matching string in $5
is the count and the sum of each $3 - $2
is the total length. Hopefully the awk
below is a good start. Thank you :).
input
chr1 1266716 1266926 chr1:1266716-1266926 TAS1R3
chr1 1267008 1267328 chr1:1267008-1267328 TAS1R3
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3
chr1 1268291 1268514 chr1:1268291-1268514 TAS1R3
chr1 1956371 1956503 chr1:1956371-1956503 GABRD
chr1 1956747 1956866 chr1:1956747-1956866 GABRD
chr1 1956947 1957187 chr1:1956947-1957187 GABRD
chr1 1220077 1220196 chr1:1220077-1220196 SCNN1D
desired output
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119
awk
awk '{count[$5]++}
END {
for (word in count)
print $1,$2,$3,$4,word, count[word]
}' input > count |
awk 'print $1,$2,$3,$4,word, count[word]
}
{ $6 = $3 - $2 }
1' count.txt > length
edit
SCNN1D 1 119
GABRD 3 240
TAS1R3 4 223
Upvotes: 1
Views: 620
Reputation: 203368
$ cat tst.awk
$5 != prev { if (NR>1) print prev, cnt, sum; prev=$5; cnt=sum=0 }
{ cnt++; sum+=($3-$2) }
END { print prev, cnt, sum }
$ awk -f tst.awk file
TAS1R3 4 1555
GABRD 3 491
SCNN1D 1 119
Upvotes: 1
Reputation: 103814
You can do:
awk '{c1[$5]++; c2[$5]+=($3-$2)}
END{for (e in c1) print e, c1[e], c2[e]}' input
Note that the order of the records may be different than the order in the original file.
Upvotes: 2
Reputation: 8446
With awk, it's possible to do the entire thing in a single script, by keeping a running count of both the cumulative length, and the number of instances for each word.
Try this (yet untested):
awk '{
offset1=$2; offset2=$3; word=$5
TotalLength[word]=offset2 - offset1 # or just $3-$2
count[word]++}
END {
for (word in count)
print word, count[word], TotalLength[word]
}' input
The original script had three errors.
awk
chunk had an ambiguous input specification: Reading from pipe and a file argument (count.txt
). In this case, awk cannot decide where to read from.END
section, the numbered fields will only refer to the fields of the last line/record read. This is not what you want.{
for the print statement.Upvotes: 1