Reputation: 29
I have two files
1st file is like this:
www.example.com
www.domain.com
www.otherexample.com
www.other-domain.com
www.other-example.com
www.exa-ample.com
2nd file is like this (numbers after ;;; are between 0-10):
www.example.com;;;2
www.domain.com;;;5
www.other-domain;;;0
www.exa-ample.com;;;4
and i want compare these two files and output to third file like this:
www.otherexample.com
www.other-example.com
Both files have large size (over 500mb)
Upvotes: 2
Views: 16607
Reputation: 42488
Use comm(1)
to compare two sorted files and to give the differences. Use grep(1)
and sort(1)
to get your files into an input format suitable for comparison with comm
. Use process substitution in bash
to tie it together:
comm -23 <(sort file1.txt) <(grep -o '^[^;]*' file2.txt | sort)
The -23
argument to comm
says to ignore lines that are common to both files (-3
) and lines unique to file 2 (-2
). Depending on your exact specification, you can use -1
, -2
or -3
.
grep -o '^[^;]*' file2.txt
just strips off everything after the first semicolon. You can use sed(1)
for this, but if you are only extracting part of a line and not adding anything else, grep
will often be faster.
comm
needs the input files to be sorted, so sort
is used to do that. The output will be sorted. sort
uses locale specific collation, so you may need to set LC_ALL=C depending on the exact collation you want.
Note in your question you have www.other-domain in file 2, but www.other-domain.com in file 1. I have assumed that it is a typo in file 2 given the output.
This runs all the processes in parallel and streams the file data through them, so even if the files are large, it will not take up a lot of memory or any extra disk space to store temporary files.
Upvotes: 6
Reputation: 189679
If the input in file2
contains a subset of the contents of file1
, you could just
sed 's/;.*//' file2 | fgrep -vxf - file1 >not-in-file2
The same general idea can be applied to diff
or comm
. However, comm
requires sorted input, but if that is not a problem (or if your data can be sorted to start with), just preprocess the data from file2
.
sed 's/;.*//' file2.sorted | comm -12 - file1.sorted >cmp.out
The constraint that input needs to be sorted is what allows comm
to handle really large files, because it just needs to keep the latest data in memory at any one time. You could do the same with your own custom awk
script.
Upvotes: 3
Reputation: 143112
You could use the diff command and direct the output to a 3 third file. E.g.,
% diff data1.txt data2.txt > diffs
The diff man page shows a number of options that give you control over the comparison (processing and output).
The basic interactive operation without specifying an options, assuming you have the data you show in your post in files data1.txt
and data2.txt
yields:
% diff data1.txt data2.txt
1,6d0
< www.example.com
< www.domain.com
< www.otherexample.com
< www.other-domain.com
< www.other-example.com
< www.exa-ample.com
Upvotes: 0
Reputation: 8802
If a
is the file with the first content and b
is the file with the second content:
while read line; do grep -q $line b || echo $line; done < a
It prints what is not found in the second file.
Upvotes: 0
Reputation: 11690
You can use:
$ diff file1 file2 > file3
But it seams to me you want to disregard ;;0
part, right?
Then you need to process it line by line stripping the last part, and, finally, comparing with diff
Upvotes: 0