Reputation: 305
I would be faced with hundreds or even thousands of files of the same name, but each in a subfolder with a different name. For simplicity of illustrating my question, I will use an example of only 3 different subfolders:
In subfolder1/logfile.txt
I would have this content:
zebra
house
apple
car
In subfolder2/logfile.txt
I would have this content:
apple
zebra
orange
sun
In subfolder3/logfile.txt
I would have this content:
sun
love
zebra
hat
And I would like to get a single output file that would count in ALL the files with filename logfile.txt in ALL subdirectories(allways only one level deep) the occurrences of each row and return each unique row with the number of occurrences.
So the output I would want to get for this example would end up like this:
3 zebra
2 apple
2 sun
1 car
1 hat
1 house
1 love
1 orange
Could this be done in one single step/command line?
Would I first need to merge the contents of all the files into one and then apply a command that counts the unique rows and outputs to me in the way I described?
Or would I need to make a Python script(I could do that, but if a simple command gets me that why reinvent the wheel?)
In any case, how would I accomplish this?
Thank you very much!
EDIT: I have some extra requirement, hopefully it can all be put into a single command. In the output returned I would like to get as a second column all the subfolders where there was an ocurrence of that line. I am only interested in knowing about those that had 5 occurrences or less. So in the example I would want in the first line of the output something like:
3 subfolder1,subfolder2,subfolder3 zebra
2 subfolder1,subfolder2 apple
and so on, and for lines that would have more than 5 occurrences(there is none in this example)I would want to get nothing at all in that second column or even better, the phrase many occurrences
Many thanks :-)
Upvotes: 1
Views: 594
Reputation: 289555
You can for example use find
as follows:
$ find /your/path -name "logfile.txt" -exec cat {} \; | sort | uniq -c | sort -rn
3 zebra
2 sun
2 apple
1 orange
1 love
1 house
1 hat
1 car
This looks for all the logfile.txt
files within the /your/path
structure and cat
s them. Then sorts the output and counts how many times each item appears. It finally sorts the output to have the biggest occurrence at the top.
According to your extended requirement, here you have a hint:
$ find . -name "logfile.txt" -exec grep -H "" {} \; | awk -F: '{a[$2]++; b[$2]=b[$2] OFS $1} END {for (i in a) print a[i], i, b[i]}' | sort -nr
3 zebra ./t2/logfile.txt ./t1/logfile.txt ./t3/logfile.txt
2 sun ./t2/logfile.txt ./t3/logfile.txt
2 apple ./t2/logfile.txt ./t1/logfile.txt
1 orange ./t2/logfile.txt
1 love ./t3/logfile.txt
1 house ./t1/logfile.txt
1 hat ./t3/logfile.txt
1 car ./t1/logfile.txt
find
gets the files like before and then grep -H "" {} \;
shows all the lines of the files, only that using the trick of -H
we get the output with the name of the file on the left:
$ grep -H "" t2/a
t2/a:apple
t2/a:zebra
t2/a:orange
t2/a:sun
The awk
command stores the times every word appears in the texts and also in which files it appears. Then it prints the results in the END
block. Finally, sort -rn
sorts the output.
Upvotes: 4