Reputation: 1061
I have
A 34 missense fixed
A 33 synonymous fixed
B 12 synonymous var
B 34 missense fixed
B 34 UTR fixed
B 45 missense var
TRI 4 synonymous var
TRI 4 intronic var
3 3 synonymous fixed
I wanna output the counts of the combinations missense && fixed, missense && var, synonymous && fixed, synonymous && var , for each element in $1
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 0
TRI 0 0 0 1
3 0 0 1 0
I can do this way with 4 individual commands selecting for each combination and concatenating the outputs
awk -F'\t' '($3~/missense/ && $4~/fixed/)' file | awk -F'\t' '{count[$1"\t"$3"\t"$4]++} END {for (word in count) print word"\t"count[word]}' > out
But I'm would like to do this for all combinations at once. I've tried some variations of this but not able to make it work
awk print a[i] -v delim=":" -v string='missense:synonymous:fixed:var' 'BEGIN {n = split(string, a, delim); for (i = 1; i <= n-2; ++i) {count[xxxx}++}} END ;for (word in count) print word"\t"count[word]}
Upvotes: 1
Views: 66
Reputation: 203684
Using any awk in any shell on every Unix box with an assist from column
to convert the tab-separated awk output to a visually tabular display if you want it:
$ cat tst.awk
BEGIN {
OFS = "\t"
numTags = split("missensefixed missensevar synonymousfixed synonymousvar",tags)
}
{
keys[$1]
cnt[$1,$3 $4]++
}
END {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", OFS, tag
}
print ""
for (key in keys) {
printf "%s", key
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = cnt[key,tag]
printf "%s%d", OFS, val
}
print ""
}
}
$ awk -f tst.awk file
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
$ awk -f tst.awk file | column -s$'\t' -t
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I'd highly recommend you always give every column a header string though so it doesn't make further processing of the data harder (e.g. reading it into Excel and sorting on headers), so if I were you I'd add printf "key"
or something else that more accurately identifies that columns contents as the first line of the END section (i.e. on a line immediately before the first for
loop) so the first column gets a header too:
$ awk -f tst.awk file | column -s$'\t' -t
key missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
Upvotes: 1
Reputation: 29222
GNU awk
supports arrays of arrays, so if it is your awk
you can count your records with something as simple as num[$1][$3$4]++
. The most complex part is the final human-friendly printing:
$ cat foo.awk
{ num[$1][$3$4]++ }
END {
printf(" missensefixed missensevar synonymousfixed synonymousvar\n");
for(r in num) printf("%3s%14d%12d%16d%14d\n", r, num[r]["missensefixed"],
num[r]["missensevar"], num[r]["synonymousfixed"], num[r]["synonymousvar"])}
$ awk -f foo.awk data.txt
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
Upvotes: 2
Reputation: 785276
You may use this awk
with multiple arrays to hold different counts:
awk -v OFS='\t' '
{keys[$1]}
/missense fixed/ {++mf[$1]}
/missense var/ {++mv[$1]}
/synonymous fixed/ {++sf[$1]}
/synonymous var/ {++sv[$1]}
END {
print "-\tmissensefixed\tmissensevar\tsynonymousfixed\tsynonymousvar"
for (i in keys)
print i, mf[i]+0, mv[i]+0, sf[i]+0, sv[i]+0
}
' file | column -t
- missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I have used column -t
for tabular output only.
Upvotes: 2