Martin Mocik
Martin Mocik

Reputation: 29

How to compare two big files and get results to third file?

I have two files

1st file is like this:

www.example.com
www.domain.com
www.otherexample.com
www.other-domain.com
www.other-example.com
www.exa-ample.com

2nd file is like this (numbers after ;;; are between 0-10):

www.example.com;;;2
www.domain.com;;;5
www.other-domain;;;0
www.exa-ample.com;;;4

and i want compare these two files and output to third file like this:

www.otherexample.com
www.other-example.com

Both files have large size (over 500mb)

Upvotes: 2

Views: 16607

Answers (5)

camh
camh

Reputation: 42488

Use comm(1) to compare two sorted files and to give the differences. Use grep(1) and sort(1) to get your files into an input format suitable for comparison with comm. Use process substitution in bash to tie it together:

comm -23 <(sort file1.txt) <(grep -o '^[^;]*' file2.txt | sort)

The -23 argument to comm says to ignore lines that are common to both files (-3) and lines unique to file 2 (-2). Depending on your exact specification, you can use -1, -2 or -3.

grep -o '^[^;]*' file2.txt just strips off everything after the first semicolon. You can use sed(1) for this, but if you are only extracting part of a line and not adding anything else, grep will often be faster.

comm needs the input files to be sorted, so sort is used to do that. The output will be sorted. sort uses locale specific collation, so you may need to set LC_ALL=C depending on the exact collation you want.

Note in your question you have www.other-domain in file 2, but www.other-domain.com in file 1. I have assumed that it is a typo in file 2 given the output.

This runs all the processes in parallel and streams the file data through them, so even if the files are large, it will not take up a lot of memory or any extra disk space to store temporary files.

Upvotes: 6

tripleee
tripleee

Reputation: 189679

If the input in file2 contains a subset of the contents of file1, you could just

sed 's/;.*//' file2 | fgrep -vxf - file1 >not-in-file2

The same general idea can be applied to diff or comm. However, comm requires sorted input, but if that is not a problem (or if your data can be sorted to start with), just preprocess the data from file2.

sed 's/;.*//' file2.sorted | comm -12 - file1.sorted >cmp.out

The constraint that input needs to be sorted is what allows comm to handle really large files, because it just needs to keep the latest data in memory at any one time. You could do the same with your own custom awk script.

Upvotes: 3

Levon
Levon

Reputation: 143112

You could use the diff command and direct the output to a 3 third file. E.g., 

% diff data1.txt data2.txt > diffs

The diff man page shows a number of options that give you control over the comparison (processing and output).

The basic interactive operation without specifying an options, assuming you have the data you show in your post in files data1.txt and data2.txt yields:

% diff data1.txt data2.txt 

1,6d0
< www.example.com
< www.domain.com
< www.otherexample.com
< www.other-domain.com
< www.other-example.com
< www.exa-ample.com

Upvotes: 0

Alessandro Pezzato
Alessandro Pezzato

Reputation: 8802

If a is the file with the first content and b is the file with the second content:

while read line; do grep -q $line b || echo $line; done < a

It prints what is not found in the second file.

Upvotes: 0

Roman Newaza
Roman Newaza

Reputation: 11690

You can use:

$ diff file1 file2 > file3

But it seams to me you want to disregard ;;0 part, right? Then you need to process it line by line stripping the last part, and, finally, comparing with diff

Upvotes: 0

Related Questions