Reputation: 1394
I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. The order of the data matters.
I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. -H had no effect.
In my application, I only need to stream the diff results. That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. If there are differences, a quick glance will tell me what is wrong.
'comm' seems well suited to this, but it does not display line numbers of the differences.
In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same.
Is there a way to get comm to dump the line number? Or a way to make diff run without loading the entire file into memory? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)?
Upvotes: 1
Views: 1185
Reputation: 23
Which versions of diff have you tried? GNU diff has a "--speed-large-files" which may help.
The comm tool assumes the lines are sorted.
Upvotes: 0
Reputation: 3801
I won't use comm
, but as you said WHAT you need, in addition to HOW you thought you should do it, I'll focus on the "WHAT you need" instead :
An interesting way would be to use paste
and awk
: paste
can show 2 files "side by side" using a separator. If you use \n
as separator, it display the 2 files with line 1 of each , followed by line 2 of each etc.
So the script you could use could be simply (once you know that there are the same number of lines in each files) :
paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
(Interrestingly, this solution will allow be easily extended to do a diff of N files in a single read, whatever the sizes of the N files are ... just adding a check that all have the same amount of lines before doing the comparison steps (otherwise "paste" will in the end show only lines from the bigger files))
Here is a (short) example, to show how it works:
$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E
$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
If it happens that the files don't have the same amount of lines, then you can add first a check of the number of line, comparing $(wc -l /tmp/file1)
and $(wc -l /tmp/file2)
, and only do the past...|awk if they have the same amount of line, to ensure the "paste" works correctly by always having one line of each! (But of course, in that case, there will be one (fast!) entire read of each file...)
You can easily adjust it to display exactly as you need it to. And you could quit after the Nth difference (either automatically, with a counter in the awk loop, or by pressing CTRL-C when you saw enough)
Upvotes: 1