Reputation: 1655
I have multiple tab delim text files in the format shown below:
File1.txt
Sample Ind Start col1 col2 col3
ID1 1 1 f g f
ID1 1 2 f g f
ID1 1 3 f g f
ID1 1 5 f g f
File2.txt
Sample Ind Start col1 col2 col3
ID2 1 1 f g f
ID2 1 2 f g f
ID2 1 4 f g f
ID2 1 5 f g f
ID2 1 6 f g f
File3.txt
Sample Ind Start col1 col2 col3
ID3 1 1 f g f
ID3 1 3 f g f
ID3 1 7 f g f
I would like to count the number of times the columns 2 and 3 i.e. columns ' Ind' and 'Start' found in total number of files and print the count and Sample to a new column.
The output should look like:
Sample Ind Start col1 col2 col3 Count Samples
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3
This is a toy example with 3 files, and in real scenario it should work with 'n' files in the directory. Could someone provide an awk solution to do this.
Upvotes: 1
Views: 150
Reputation: 755114
Superficially, this would do the job:
awk 'NR == 1 { OFS="\t"; print $0, "Count", "Samples"; next }
FNR == 1 { next }
{ if (line[$2,$3] == "")
line[$2,$3] = $0;
count[$2,$3]++;
if (idlist[$2,$3] != "")
idlist[$2,$3] = idlist[$2,$3] "," $1
else
idlist[$2,$3] = $1
}
END { for (idx in count) print line[idx], count[idx], idlist[idx]; }
' File*.txt
I had to make an assumption since the question doesn't state or illustrate how different values in 'col1', 'col2' and 'col3' should be handled, if at all. I've assumed that the first line with the given values in 'Ind' and 'Start' are representative enough. If there's a different rationale required, then a new question should be asked.
Note that the sequence of values in the output is not guaranteed.
Running the script above on the data in the question on Mac OS X 10.10.5 with the BSD awk
yielded:
Sample Ind Start col1 col2 col3 Count Samples
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID2 1 4 f g f 1 ID2
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3
Using GNU awk
on the same machine and same data set yielded:
Sample Ind Start col1 col2 col3 Count Samples
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID2 1 4 f g f 1 ID2
Upvotes: 3