dodecafonico
dodecafonico

Reputation: 305

Unix: how to count instances of each line across multiple files

I would be faced with hundreds or even thousands of files of the same name, but each in a subfolder with a different name. For simplicity of illustrating my question, I will use an example of only 3 different subfolders:

In subfolder1/logfile.txt I would have this content:

zebra
house
apple
car

In subfolder2/logfile.txt I would have this content:

apple
zebra
orange
sun

In subfolder3/logfile.txt I would have this content:

sun
love
zebra
hat

And I would like to get a single output file that would count in ALL the files with filename logfile.txt in ALL subdirectories(allways only one level deep) the occurrences of each row and return each unique row with the number of occurrences.

So the output I would want to get for this example would end up like this:

3 zebra
2 apple
2 sun
1 car
1 hat
1 house
1 love
1 orange

Could this be done in one single step/command line?

Would I first need to merge the contents of all the files into one and then apply a command that counts the unique rows and outputs to me in the way I described?

Or would I need to make a Python script(I could do that, but if a simple command gets me that why reinvent the wheel?)

In any case, how would I accomplish this?

Thank you very much!

EDIT: I have some extra requirement, hopefully it can all be put into a single command. In the output returned I would like to get as a second column all the subfolders where there was an ocurrence of that line. I am only interested in knowing about those that had 5 occurrences or less. So in the example I would want in the first line of the output something like:

3 subfolder1,subfolder2,subfolder3 zebra

2 subfolder1,subfolder2 apple

and so on, and for lines that would have more than 5 occurrences(there is none in this example)I would want to get nothing at all in that second column or even better, the phrase many occurrences

Many thanks :-)

Upvotes: 1

Views: 594

Answers (1)

fedorqui
fedorqui

Reputation: 289555

You can for example use find as follows:

$ find /your/path -name "logfile.txt" -exec cat {} \; | sort | uniq -c | sort -rn
      3 zebra
      2 sun
      2 apple
      1 orange
      1 love
      1 house
      1 hat
      1 car

This looks for all the logfile.txt files within the /your/path structure and cats them. Then sorts the output and counts how many times each item appears. It finally sorts the output to have the biggest occurrence at the top.


Update

According to your extended requirement, here you have a hint:

$ find . -name "logfile.txt" -exec grep -H "" {} \; | awk -F: '{a[$2]++; b[$2]=b[$2] OFS $1} END {for (i in a) print a[i], i, b[i]}' | sort -nr
3 zebra  ./t2/logfile.txt ./t1/logfile.txt ./t3/logfile.txt
2 sun  ./t2/logfile.txt ./t3/logfile.txt
2 apple  ./t2/logfile.txt ./t1/logfile.txt
1 orange  ./t2/logfile.txt
1 love  ./t3/logfile.txt
1 house  ./t1/logfile.txt
1 hat  ./t3/logfile.txt
1 car  ./t1/logfile.txt

find gets the files like before and then grep -H "" {} \; shows all the lines of the files, only that using the trick of -H we get the output with the name of the file on the left:

$ grep -H "" t2/a
t2/a:apple
t2/a:zebra
t2/a:orange
t2/a:sun

The awk command stores the times every word appears in the texts and also in which files it appears. Then it prints the results in the END block. Finally, sort -rn sorts the output.

Upvotes: 4

Related Questions