Madza Farias-Virgens
Madza Farias-Virgens

Reputation: 1061

Looping through combinations of selected strings in specific columns and counting their occurrence

I have

A 34 missense fixed
A 33 synonymous fixed
B 12 synonymous var
B 34 missense fixed
B 34 UTR fixed
B 45 missense var
TRI 4 synonymous var
TRI 4 intronic var
3 3 synonymous fixed

I wanna output the counts of the combinations missense && fixed, missense && var, synonymous && fixed, synonymous && var , for each element in $1

   missensefixed  missensevar synonymousfixed synonymousvar
A  1              0           1               0
B  1              1           0               0  
TRI 0             0           0               1
3  0              0           1               0

I can do this way with 4 individual commands selecting for each combination and concatenating the outputs

awk  -F'\t' '($3~/missense/ && $4~/fixed/)' file | awk -F'\t' '{count[$1"\t"$3"\t"$4]++} END {for (word in count) print word"\t"count[word]}' > out

But I'm would like to do this for all combinations at once. I've tried some variations of this but not able to make it work

 awk print a[i] -v delim=":" -v string='missense:synonymous:fixed:var'  'BEGIN {n = split(string, a, delim); for (i = 1; i <= n-2; ++i) {count[xxxx}++}} END ;for (word in count) print word"\t"count[word]}

Upvotes: 1

Views: 66

Answers (3)

Ed Morton
Ed Morton

Reputation: 203684

Using any awk in any shell on every Unix box with an assist from column to convert the tab-separated awk output to a visually tabular display if you want it:

$ cat tst.awk
BEGIN {
    OFS = "\t"
    numTags = split("missensefixed  missensevar synonymousfixed synonymousvar",tags)
}
{
    keys[$1]
    cnt[$1,$3 $4]++
}
END {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", OFS, tag
    }
    print ""
    for (key in keys) {
        printf "%s", key
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            val = cnt[key,tag]
            printf "%s%d", OFS, val
        }
        print ""
    }
}

$ awk -f tst.awk file
        missensefixed   missensevar     synonymousfixed synonymousvar
A       1       0       1       0
B       1       1       0       1
TRI     0       0       0       1
3       0       0       1       0

$ awk -f tst.awk file | column -s$'\t' -t
     missensefixed  missensevar  synonymousfixed  synonymousvar
A    1              0            1                0
B    1              1            0                1
TRI  0              0            0                1
3    0              0            1                0

I'd highly recommend you always give every column a header string though so it doesn't make further processing of the data harder (e.g. reading it into Excel and sorting on headers), so if I were you I'd add printf "key" or something else that more accurately identifies that columns contents as the first line of the END section (i.e. on a line immediately before the first for loop) so the first column gets a header too:

$ awk -f tst.awk file | column -s$'\t' -t
key  missensefixed  missensevar  synonymousfixed  synonymousvar
A    1              0            1                0
B    1              1            0                1
TRI  0              0            0                1
3    0              0            1                0

Upvotes: 1

Renaud Pacalet
Renaud Pacalet

Reputation: 29222

GNU awk supports arrays of arrays, so if it is your awk you can count your records with something as simple as num[$1][$3$4]++. The most complex part is the final human-friendly printing:

$ cat foo.awk
{ num[$1][$3$4]++ }
END {
  printf("    missensefixed missensevar synonymousfixed synonymousvar\n");
  for(r in num) printf("%3s%14d%12d%16d%14d\n", r, num[r]["missensefixed"],
    num[r]["missensevar"], num[r]["synonymousfixed"], num[r]["synonymousvar"])}

$ awk -f foo.awk data.txt
    missensefixed missensevar synonymousfixed synonymousvar
  A             1           0               1             0
  B             1           1               0             1
TRI             0           0               0             1
  3             0           0               1             0

Upvotes: 2

anubhava
anubhava

Reputation: 785276

You may use this awk with multiple arrays to hold different counts:

awk -v OFS='\t' '
{keys[$1]}
/missense fixed/ {++mf[$1]}
/missense var/ {++mv[$1]}
/synonymous fixed/ {++sf[$1]}
/synonymous var/ {++sv[$1]}
END {
   print "-\tmissensefixed\tmissensevar\tsynonymousfixed\tsynonymousvar"
   for (i in keys)
      print i, mf[i]+0, mv[i]+0, sf[i]+0, sv[i]+0
}
' file | column -t

-    missensefixed  missensevar  synonymousfixed  synonymousvar
A    1              0            1                0
B    1              1            0                1
TRI  0              0            0                1
3    0              0            1                0

I have used column -t for tabular output only.

Upvotes: 2

Related Questions