alvas
alvas

Reputation: 122022

How to find duplicate lines across 2 different files? Unix

From the unix terminal, we can use diff file1 file2 to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.

Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq.

file1: http://pastebin.com/taRcegVn

file2: http://pastebin.com/2fXeMrHQ

And the output should output the lines that appears in both files.

output: http://pastebin.com/FnjXFshs

I am able to use python to do it as such but i think it's a little too much to put into the terminal:

x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
  print>>outfile, i

Upvotes: 19

Views: 41207

Answers (2)

user1149862
user1149862

Reputation:

As @tjameson mentioned it may be solved in another thread. Just would like to post another solution: sort file1 file2 | awk 'dup[$0]++ == 1'

  1. refer to awk guide to get some awk basics, when the pattern value of a line is true this line will be printed

  2. dup[$0] is a hash table in which each key is each line of the input, the original value is 0 and increments once this line occurs, when it occurs again the value should be 1, so dup[$0]++ == 1 is true. Then this line is printed.

Note that this only works when there are not duplicates in either file, as was specified in the question.

Upvotes: 18

user35147863
user35147863

Reputation: 2605

If you want to get a list of repeated lines without resorting to AWK, you can use -d flag to uniq:

sort file1 file2 | uniq -d

Upvotes: 36

Related Questions