drbunsen
drbunsen

Reputation: 10689

How to find the Set - Subset of two files from the command line?

I have two files with sorted lines. One file (B) is a subset of the other file (A). I would like to find all lines in A that ARE NOT in B. Ideally, I would like to create a file (C) that contains these lines. Is this possible in Unix? I'm looking for a one line command to do this instead of writing a script. I looked at the join and diff commands, but I could not find a command option to do this. Thanks for the help.

Upvotes: 6

Views: 2851

Answers (5)

johnshen64
johnshen64

Reputation: 3884

This will suppress common lines:

comm -3 a b

Upvotes: 13

Dennis Williamson
Dennis Williamson

Reputation: 360065

This join command will do what you're asking:

join -v 1 fileA fileB > fileC

Demonstration:

$ cat fileA
a
c
d
g
h
t
u
v
z
$ cat fileB
a
d
g
t
u
z
$ join -v 1 fileA fileB
c
h
v

This assumes sorted files as you stated in your question. For unsorted files:

join -v 1 <(sort fileA) <(sort fileB)

Upvotes: 1

Debaditya
Debaditya

Reputation: 2497

Awk Solution

Input files

a

aaa
bbb
ccc

b

ccc
ddd
eel

Code

awk ' NR==FNR { A[$0]=1; next; }
{ if ($0 in A) { A[$0]=0; } }
END { for (k in A) { if (A[k]==1) { print k; } } } ' a b > c

c (Output file)

bbb
aaa

Upvotes: 0

derobert
derobert

Reputation: 51147

You can do this with diff as well. Diff (unlike @johlo's grep answer) cares about order, works on non-sorted files (unlike @johnshen64's comm answer) :

$ cat a
a
b
c
d
e
$ cat b
a
b
f
d
e
$ diff -dbU0 a b
--- a   2012-05-18 16:02:30.603386016 -0400
+++ b   2012-05-18 16:02:45.547817122 -0400
@@ -3 +3 @@
-c
+f

So you can use a pipeline to get just the omitted lines—considering order:

$ diff -dbU0 a b | tail -n +4 | grep ^- | cut -c2-
c

Upvotes: 3

johlo
johlo

Reputation: 5500

How about this:

grep -v -f B A > C

Upvotes: 5

Related Questions