justaguy
justaguy

Reputation: 3022

count by field in awk before pipe symbol

I am trying to count the unique entries in $2 in a file before the |:

file

chr11:101323676-101323847   TRPC6|gc=39.2   143.1
chr11:101324359-101324478   TRPC6|gc=38.7   146.4
chr11:101325731-101325850   TRPC6|gc=32.8   84.5
chr11:101341904-101342127   TRPC6|gc=43.5   197.9
chr12:5153304-5155165   KCNA5|gc=65.1   633.7
chr12:52306230-52306349 ACVRL1|gc=58.8  152.4
chr12:52306868-52307149 ACVRL1|gc=66.5  309.6
chr12:52307328-52307569 ACVRL1|gc=66.8  305.9
chr12:52307743-52307872 ACVRL1|gc=64.3  267.1

desired output

3

Tried:

awk '{sub(/:.*/,"",$2)} !seen[$2]++{unq++} END{print unq}' file.txt

Currently, I am getting a very different number and think it is necause I need to split on the | but not sure the correct way to do so. Thank you :).

Upvotes: 2

Views: 74

Answers (2)

karakfa
karakfa

Reputation: 67507

awk to the rescue!

$ awk '{split($2,a,"|"); c[a[1]]} 
    END{for(k in c) count++; print count}' file

3

or shorter version

$ awk '{split($2,a,"|"); if(!c[a[1]]++) count++} 
    END{print count}' file

shortest

$ awk 'split($2,a,"|") && !c[a[1]]++{u++} END{print u}' file

Upvotes: 4

hek2mgl
hek2mgl

Reputation: 158100

You were almost there. You simply need to replace : by \| in the regex used in sub():

awk '{sub(/\|.*/,"",$2)}!seen[$2]++{c++}END{print c}' file

You can also play with the delimiter like this:

awk -F'[|]| +' '!seen[$2]++{c++}END{print c}' file

I'm using either | or one or more spaces as the delimiter. This makes it possible to access the part of interest as $2.

The remaining part follows the same logic as the example in your question: We use $2 as index in the lookup table seen and check if this index has appeared before. If not, we increment the counter c and at the end we print c.

Upvotes: 3

Related Questions