user971102
user971102

Reputation: 3085

Print a count of unique values for each column in many files

I have huge binary matrices of many columns, and I am trying to get a count of zeroes and ones in each field, for each file, while keeping track of the file and the headers. Each file has the same headers and number of columns (but variable number of rows) and it is like this:

File 1:
Header1 Header2 Header3 Header4
0 1 0 1 
0 1 0 1
1 0 0 1
0 1 0 1

File 2:
Header1 Header2 Header3 Header4
0 1 0 0 
0 0 0 0
0 0 0 1

Desired output with counts of 0/1

    Header1 Header2 Header3 Header4 Total
File1 1 3 0 4  4
File2 0 1 0 1  3

For now I have a count of values equal to ones for only file1, but it comes out with each row as a header while I would like the original headers to stay as headers and also this doesn't print a 0 if there are no ones... And it does not contain the original file name, so it is not right overall! Can you please direct me to the correct way to do this?

awk 'NF>0{
  for (i=1; i<=NF; i++) 
      if(NR==1)h[i]=$i;else if($i==1) a[i]++;
  } END{for(i=1; i<=length(a); i++) print h[i], a[i], NR}' file1

Upvotes: 4

Views: 148

Answers (3)

Ed Morton
Ed Morton

Reputation: 204731

It's MUCH simpler than you think. With GNU awk (which you are already using since you used the gawk extension length(array) in your code) for ENDFILE:

$ cat tst.awk
BEGIN { OFS="\t" }
NR==1 { print "", $0, "Total" }
FNR>1 {
    for (i=1; i<=NF; i++) {
        cnt[i,$i]++
    }
}
ENDFILE {
    printf "%s%s", FILENAME, OFS
    for (i=1; i<=NF; i++) {
        printf "%d%s", cnt[i,1], OFS
    }
    print FNR-1
    delete cnt
}

$ awk -f tst.awk file1 file2
        Header1 Header2 Header3 Header4 Total
file1   1       3       0       4       4
file2   0       1       0       1       3

The above only stores a tiny amount of data in an array (the count of values per field in 1 file at a time) so it uses minimal memory and does very few operations and so should run extremely fast.

As @ghoti points out you may not be using gawk after all so here's a non-gawk version that just relies on length(array):

$ cat tst.awk
BEGIN { OFS="\t" }
NR==1 { print "", $0, "Total" }
FNR==1 { prt(); next }
{
    for (i=1; i<=NF; i++) {
        cnt[i,$i]++
    }
}
END { prt() }

function prt() {
    if (prevFilename) {
        printf "%s%s", prevFilename, OFS
        for (i=1; i<=NF; i++) {
            printf "%d%s", cnt[i,1], OFS
        }
        print length(cnt) - NF
        delete cnt
    }
    prevFilename = FILENAME
}

$ awk -f tst.awk file1 file2
        Header1 Header2 Header3 Header4 Total
file1   1       3       0       4       3
file2   0       1       0       1       4

Upvotes: 3

ghoti
ghoti

Reputation: 46896

The following seems to work for me:

awk '
  # Gather headers, only from the first line of the first file.
  NR==1{
    for(i=1;i<=NF;i++){
      h[i]=$i;
    }
  }
  # Do not process header as if they were data.
  FNR==1{ next; }

  NF>limit{ limit=NF; }

  # Step through data 
  {
    f[FILENAME]++;
    for(i=1;i<=NF;i++){
      a[FILENAME,i]+=$i;
    }
  }

  # Display what we found.
  END{
    # Headers...
    printf("File\t");
    for(i=1;i<=length(h);i++){
      printf("%s\t",h[i])
    }
    print "Total";

    # And data.
    for(file in f){
      printf("%s",file);
      for(i=1;i<=limit;i++){
        printf("\t%d",a[file,i])
      }
      printf("\t%d\n",f[file]);
    }
  }' file1 file2

Note that we're keeping an array f[] for filenames because awk doesn't really support multidimensional arrays. The script above should work in just about any old awk. (I tested it in FreeBSD.) Though you might run in to a problem if you're processing millions of files, as the arrays use a non-zero amount of memory. On the other hand, the number of files is also limited by your shell's command line length. :-)

One thing I'm not sure of is why your header count does not match the number of fields in your data. But perhaps this is sufficient for you to take it the rest of the way.

Upvotes: 4

Simon
Simon

Reputation: 10841

Assuming that the example files should, in fact, have as many headers as there are columns (the example has four columns but only three headers), the following code works for me:

#!/bin/sh
awk '
    function pr(filename) {
        if (filename) printf ("%s",filename)
        for (i=1; i<=NF; i++) {
            if (filename)
                printf ("%s%s",OFS,a[i])
            else
                printf ("%s%s",OFS,$i) 
            a[i] = 0
            }
        if (filename)
            printf ("%s%s",OFS,prevFNR-1) 
        else 
            printf ("%sTotal",OFS)
        printf ("\n")
        }

    FNR==1  {
            pr(prevFileName)
            prevFileName = FILENAME
            next
            }

    NF>0    {
            for (i=1; i<=NF; i++) 
                if ($i==1) a[i]++
            prevFNR = FNR
            } 

    END {
        pr(FILENAME)
        }' file1 file2

There is no way to know that the program has reached the last line of any given file but FNR==1 is true at the start of the next file, so I've used that to trigger printing each line. For that reason, the program uses prevFNR and prevFileName to remember the number of records in the previous file and the name of that file for display. The printing code is called from two different places so I've put it in a function, pr(), which uses the fact that prevFileName doesn't have a value the first time FNR==1 to indicate that it should print the header line rather than the computed summary information.

The output is:

 Header1 Header2 Header3 Header4 Total
file1 1 3 0 4 4
file2 0 1 0 1 3

Upvotes: 4

Related Questions