Reputation: 51593
I need to work with large files and must find differences between two. And I don't need the different bits, but the number of differences.
To find the number of different rows I come up with
diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l
And it works, but is there a better way to do it?
And how to count the exact number of differences (with standard tools like bash, diff, awk, sed some old version of perl)?
Upvotes: 58
Views: 78078
Reputation: 802
If you want to count the number of lines that are different use this:
diff -U 0 file1 file2 | grep ^@ | wc -l
Doesn't John's answer double count the different lines?
Upvotes: 54
Reputation: 1664
I would've loved to edit @Josh or @John's answer, but the edit queue is full, so here goes:
diff -U 0 file1 file2 | tail -n +3 | grep -c '^@'
Why?
diff -U 0 file1 file2
outputs something like:
--- file1 (+ timestamp)
+++ file2 (+ timestamp)
@@ range information for first difference @@
+ some
+ added
+ lines
@@ range info for second difference @@
- some
- removed
- lines
@@ range info for edit @@
- I changed
- this
+ into
+ these new
+ lines
More information about range info in this SO answer
So:
tail -n +3
removes the content until the 3rd line. In other words, this removes the 2 file information linesgrep -c '^@'
counts the lines starting with '@' that is the modified rangesThe output is therefore the counts of the differences, here "difference" taken as a range that underwent modification.
For instance, with the above example diff, the output would be:
3
but by difference I mean a count of the modified lines!
Since, as pointed out in the other answers, a modification of a single line will show up twice, both as a deletion -
and as an addition +
is then best to separate between additions and deletions.
Here you go:
diff -U 0 file1 file2 | tail -n +3 | perl -ne 'if (/^\+/) { $add +=1 }; if (/^-/) { $del += 1 }; END { if (!$add) { $add=0 }; if (!$del) { $del=0 }; print "+$add -$del\n"}'
What does the perl "one-liner" do?
# for each line ( implicit with the -n flag for perl ):
if (/^\+/) { $add +=1 }; # increase the count of added lines, starting with +
if (/^-/) { $del += 1 }; # increase the count of deleted lines, starting with -
END { # at the end of the processing
if (!$add) { $add=0 }; # set count to 0 if no added line
if (!$del) { $del=0 }; # set count to 0 if no deleted line
print "+$add -$del\n" # print the count of added lines and the count of deleted lines
}
Sample output for the above diff example:
+6 -5
Upvotes: 1
Reputation: 2070
If you're dealing with files with analogous content that should be sorted the same line-for-line (like CSV files describing similar things) and you would e.g. want to find 2 differences in the following files:
File a: File b:
min,max min,max
1,5 2,5
3,4 3,4
-2,10 -1,1
you could implement it in Python like this:
different_lines = 0
with open(file1) as a, open(file2) as b:
for line in a:
other_line = b.readline()
if line != other_line:
different_lines += 1
Upvotes: 0
Reputation: 868
Here is a way to count any kind of differences between two files, with specified regex for those differences - here .
for any character except newline:
git diff --patience --word-diff=porcelain --word-diff-regex=. file1 file2 | pcre2grep -M "^@[\s\S]*" | pcre2grep -M --file-offsets "(^-.*\n)(^\+.*\n)?|(^\+.*\n)" | wc -l
An excerpt from man git-diff
:
--patience Generate a diff using the "patience diff" algorithm. --word-diff[=<mode>] Show a word diff, using the <mode> to delimit changed words. By default, words are delimited by whitespace; see --word-diff-regex below. porcelain Use a special line-based format intended for script consumption. Added/removed/unchanged runs are printed in the usual unified diff format, starting with a +/-/` ` character at the beginning of the line and extending to the end of the line. Newlines in the input are represented by a tilde ~ on a line of its own. --word-diff-regex=<regex> Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word. Also implies --word-diff unless it was already enabled. Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!) for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline. For example, --word-diff-regex=. will treat each character as a word and, correspondingly, show differences character by character.
pcre2grep
is part of pcre2-utils
package on Ubuntu 20.04.
Upvotes: 0
Reputation: 4841
I believe the correct solution is in this answer, that is:
$ diff -y --suppress-common-lines a b | grep '^' | wc -l
1
Upvotes: 5
Reputation: 187
Since every output line that differs starts with <
or >
character, I would suggest this:
diff file1 file2 | grep ^[\>\<] | wc -l
By using only \<
or \>
in the script line you can count differences only in one of the files.
Upvotes: 5
Reputation: 4802
If using Linux/Unix, what about comm -1 file1 file2
to print lines in file1 that aren't in file2, comm -1 file1 file2 | wc -l
to count them, and similarly for comm -2 ...
?
Upvotes: 5
Reputation: 361585
diff -U 0 file1 file2 | grep -v ^@ | wc -l
That minus 2 for the two file names at the top of the diff
listing. Unified format is probably a bit faster than side-by-side format.
Upvotes: 52