neversaint
neversaint

Reputation: 64004

Finding Set Complement in Unix

Given this two files:

 $ cat A.txt     $ cat B.txt
    3           11
    5           1
    1           12
    2           3
    4           2

I want to find lines number that is in A "BUT NOT" in B. What's the unix command for it?

I tried this but seems to fail:

comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g' 

Upvotes: 3

Views: 2722

Answers (5)

tommy.carstensen
tommy.carstensen

Reputation: 9622

Here is another way to do it with join:

join -v1 <(sort A.txt) <(sort B.txt)

From the documentation on join:

‘-v file-number’ Print a line for each unpairable line in file file-number (either ‘1’ or ‘2’), instead of the normal output.

Upvotes: 1

Robert Massaioli
Robert Massaioli

Reputation: 13477

I wrote a program recently called Setdown that does Set operations from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!

At any rate, I think that it's pretty cool and you should totally check it out.

Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.

Upvotes: 1

sporobolus
sporobolus

Reputation: 21

note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result

also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:

$ cat A.txt 
120
121
122
122
$ cat B.txt 
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122

if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):

$ comm -23 <(sort -u A.txt) <(sort B.txt)
120

Upvotes: 2

ghostdog74
ghostdog74

Reputation: 342363

you can try this

$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4

Upvotes: 3

Alok Singhal
Alok Singhal

Reputation: 96131

comm -2 -3 <(sort A.txt) <(sort B.txt)

should do what you want, if I understood you correctly.

Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:

$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4

Upvotes: 10

Related Questions