Madza Farias-Virgens
Madza Farias-Virgens

Reputation: 1071

Counting unique occurrences in each column

I have a file with several columns like $2$3 (until $32) as in

A refdevhet devdevhomo
B refdevhet refdevhet
C refrefhomo refdevhet
D devrefhet  refdevhet

I need to count how many occurrences of each unique element in each column separately

so that I have

refdevhet  2 3
refrefhomo 1 0
devrefhet  1 0
devdevhomo 0 1

I tried several variations of

awk 'BEGIN {
  FS=OFS="\t"
}

{
  for(i=1; i<=32; i++) a[$i]++
}

END {
  for (i in a) print i, a[i]
}' file

but instead it's printing the cumulative sum of occurrences of unique elements across the selected fields.

Upvotes: 5

Views: 136

Answers (2)

glenn jackman
glenn jackman

Reputation: 247230

In addition to @Andriy's good answer, with GNU awk you can use a 2-dimensional array

gawk '
  {for (i=2; i<=NF; i++) count[$i][i]++}
  END {
    for (word in count) {
      printf "%s", word
      for (i=2; i<=NF; i++) printf "%s%d", OFS, count[word][i]
      print ""
    }
  }
' file | column -t

I'm assuming here that each line has the same number of fields as the last line.

Upvotes: 4

Andriy Makukha
Andriy Makukha

Reputation: 8344

Here is a solution:

BEGIN {
    FS=OFS="\t"
}
{
    if (NF>mxf) mxf = NF;
    for(i=1; i<=NF; i++) {ws[$i]=1; c[$i,i]++}
} 
END {
    for (w in ws) {
        printf "%s", w
        for (i=1;i<=mxf;i++) printf "%s%d", OFS, c[w,i];
        print ""
    }
}

Notice that solution is general. It will include first column into consideration as well. To omit the first column, change i=1 to i=2 in both places.

Upvotes: 6

Related Questions