Reputation: 21
I am in the need of finding common lines across multiple files; more than 100 files with millions of lines each. Similar to this: Shell: Find Matching Lines Across Many Files.
However, I would like to find not only shared lines across all files but also those lines that are found in all files except one, all files except two and so on. I am interested in using percentages to do so. For example, which entries show up in 90% of the files, 80%, 70% and so on. As an example:
File1
lineA
lineB
lineC
File2
lineB
lineC
lineD
File3
lineC
lineE
lineF
Hypothetical output for the sake of demonstration:
<lineC> is found in 3 out of 3 files (100.00%)
<lineB> is found in 2 out of 3 files (66.67%)
<lineF> is found in 1 out of 3 files (33.33%)
Does anyone know how to do it?
Thank you very much!
Upvotes: 0
Views: 227
Reputation: 247210
With GNU awk for its multidimensional arrays:
gawk '
BEGIN {nfiles = ARGC-1}
{ lines[$0][FILENAME] = 1 }
END {
for (line in lines) {
n = length(lines[line])
printf "<%s> is found in %d of %d files (%.2f%%)\n", line, n, nfiles, 100*n/nfiles
}
}
' file{1,2,3}
<lineA> is found in 1 of 3 files (33.33%)
<lineB> is found in 2 of 3 files (66.67%)
<lineC> is found in 3 of 3 files (100.00%)
<lineD> is found in 1 of 3 files (33.33%)
<lineE> is found in 1 of 3 files (33.33%)
<lineF> is found in 1 of 3 files (33.33%)
The order of output is indeterminate
Upvotes: 2