Phil
Phil

Reputation: 113

How do I get UNIX diff to ignore duplicate lines in different positions?

I have two CSV files about 134 mb.

All I want to do is get the 'diff' of the two files, except the position of a line doesn't matter.

In other words, let's say I have:

abc,123
def,456

and

def,456
ghi,789

I don't want to be told about def,456. It's in a different position in the second file, but I want it to be counted as not being different.

Just doing diff file1 file2 > outputfile isn't working. What command should I use to do this? I know this is trivial in PHP but I run out of memory quickly. I'd rather just use UNIX command line tools. Diff may not even be the right utility for this.

Upvotes: 3

Views: 2424

Answers (2)

Fredrik Pihl
Fredrik Pihl

Reputation: 45670

I would propose that you do a sort on the two input files and then compare the two sorted versions, something like this:

sort file1 > sorted_1
sort file2 > sorted_2

diff sorted_1 sorted_2

Upvotes: 2

user2100815
user2100815

Reputation:

Sorry, what diff does is identify differences like that. I think what you want is a tool that identifies:

1
2
3

and:

3
1
2

as being the same. There is no tool I know of that does this (but I might add it to to my http://code.google.com/p/csvfix/ tool at some point).

What you currently need to do is sort both files and then diff them.

Upvotes: 0

Related Questions