Reputation: 3022
I am trying to count the unique entries in $2
in a file before the |
:
file
chr11:101323676-101323847 TRPC6|gc=39.2 143.1
chr11:101324359-101324478 TRPC6|gc=38.7 146.4
chr11:101325731-101325850 TRPC6|gc=32.8 84.5
chr11:101341904-101342127 TRPC6|gc=43.5 197.9
chr12:5153304-5155165 KCNA5|gc=65.1 633.7
chr12:52306230-52306349 ACVRL1|gc=58.8 152.4
chr12:52306868-52307149 ACVRL1|gc=66.5 309.6
chr12:52307328-52307569 ACVRL1|gc=66.8 305.9
chr12:52307743-52307872 ACVRL1|gc=64.3 267.1
desired output
3
Tried:
awk '{sub(/:.*/,"",$2)} !seen[$2]++{unq++} END{print unq}' file.txt
Currently, I am getting a very different number and think it is necause I need to split on the |
but not sure the correct way to do so. Thank you :).
Upvotes: 2
Views: 74
Reputation: 67507
awk
to the rescue!
$ awk '{split($2,a,"|"); c[a[1]]}
END{for(k in c) count++; print count}' file
3
or shorter version
$ awk '{split($2,a,"|"); if(!c[a[1]]++) count++}
END{print count}' file
shortest
$ awk 'split($2,a,"|") && !c[a[1]]++{u++} END{print u}' file
Upvotes: 4
Reputation: 158100
You were almost there. You simply need to replace :
by \|
in the regex used in sub()
:
awk '{sub(/\|.*/,"",$2)}!seen[$2]++{c++}END{print c}' file
You can also play with the delimiter like this:
awk -F'[|]| +' '!seen[$2]++{c++}END{print c}' file
I'm using either |
or one or more spaces as the delimiter. This makes it possible to access the part of interest as $2
.
The remaining part follows the same logic as the example in your question: We use $2
as index in the lookup table seen
and check if this index has appeared before. If not, we increment the counter c
and at the end we print c
.
Upvotes: 3