chas
chas

Reputation: 1655

How to count and print the occurence of a column in multiple files using awk

I have multiple tab delim text files in the format shown below:

 File1.txt
 Sample    Ind    Start    col1    col2    col3
  ID1       1      1       f       g       f
  ID1       1      2       f       g       f
  ID1       1      3       f       g       f
  ID1       1      5       f       g       f

File2.txt
 Sample    Ind    Start    col1    col2    col3
  ID2       1      1       f       g       f
  ID2       1      2       f       g       f
  ID2       1      4       f       g       f
  ID2       1      5       f       g       f
  ID2       1      6       f       g       f

 File3.txt
 Sample    Ind    Start    col1    col2    col3
  ID3       1      1       f       g       f
  ID3       1      3       f       g       f
  ID3       1      7       f       g       f

I would like to count the number of times the columns 2 and 3 i.e. columns ' Ind' and 'Start' found in total number of files and print the count and Sample to a new column.

The output should look like:

  Sample    Ind    Start    col1    col2    col3    Count    Samples
  ID1       1      1       f       g       f         3    ID1,ID2,ID3
  ID1       1      2       f       g       f         2    ID1,ID2
  ID1       1      3       f       g       f         2    ID1,ID3
  ID1       1      5       f       g       f         2    ID1,ID2
  ID2       1      6       f       g       f         1    ID2
  ID3       1      7       f       g       f         1    ID3

This is a toy example with 3 files, and in real scenario it should work with 'n' files in the directory. Could someone provide an awk solution to do this.

Upvotes: 1

Views: 150

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 755114

Superficially, this would do the job:

awk 'NR == 1  { OFS="\t"; print $0, "Count", "Samples"; next }
     FNR == 1 { next }
              { if (line[$2,$3] == "")
                    line[$2,$3] = $0;
                count[$2,$3]++;
                if (idlist[$2,$3] != "")
                    idlist[$2,$3] = idlist[$2,$3] "," $1
                else
                    idlist[$2,$3] = $1
              }
     END      { for (idx in count) print line[idx], count[idx], idlist[idx]; }
    ' File*.txt

I had to make an assumption since the question doesn't state or illustrate how different values in 'col1', 'col2' and 'col3' should be handled, if at all. I've assumed that the first line with the given values in 'Ind' and 'Start' are representative enough. If there's a different rationale required, then a new question should be asked.

Note that the sequence of values in the output is not guaranteed.

Running the script above on the data in the question on Mac OS X 10.10.5 with the BSD awk yielded:

 Sample    Ind    Start    col1    col2    col3 Count   Samples
  ID1       1      1       f       g       f    3   ID1,ID2,ID3
  ID1       1      2       f       g       f    2   ID1,ID2
  ID1       1      3       f       g       f    2   ID1,ID3
  ID2       1      4       f       g       f    1   ID2
  ID1       1      5       f       g       f    2   ID1,ID2
  ID2       1      6       f       g       f    1   ID2
  ID3       1      7       f       g       f    1   ID3

Using GNU awk on the same machine and same data set yielded:

 Sample    Ind    Start    col1    col2    col3 Count   Samples
  ID1       1      5       f       g       f    2   ID1,ID2
  ID2       1      6       f       g       f    1   ID2
  ID3       1      7       f       g       f    1   ID3
  ID1       1      1       f       g       f    3   ID1,ID2,ID3
  ID1       1      2       f       g       f    2   ID1,ID2
  ID1       1      3       f       g       f    2   ID1,ID3
  ID2       1      4       f       g       f    1   ID2

Upvotes: 3

Related Questions