Reputation: 2378

Sort two files in Linux and find lines unique to each file

I have 2 files.

File1 content looks like:

000000513609200,238/PLMN/000100
000000513609200,238/PLMN/000200
000050354428060,238/PLMN/000200
000050354428060,238/PLMN/000100
001212131415120,238/PLMN/000100
...
...

File2 contents:

000000513609200,238/PLMN/000100
000000513609200,238/PLMN/000200
000050354428060,238/PLMN/000200
000050354428060,238/PLMN/000100
001212131415120,238/PLMN/000100
...
...

File1 has close to 15000 records and file2 has close to 20000 records. I want to find the lines(records) present only in file1 or file2. I'm using the below:

comm -3 <(sort file1) <(sort file2) > file6

Is this a good option?

Also how exactly the sort works with these records ? How will it understand which column to take as primary key ?

Also can you suggest a simple awk script to do the comparison between file1 and file2 and forward the lines present either only in file1 or only in file2 to file7, so that I can compare the outputs. I want to make sure that my comm is yielding the same result.

Upvotes: 1

Answers (4)

kometen

Reputation: 7862

This sorts with the -u (unique) flag and remove all duplicates in either files.

sort -u file1 file2 > file6

Upvotes: 2

anubhava

Reputation: 786289

Using awk you can do this without sorting:

awk 'FNR==NR {
   a[$0]
   next
}
{
   if ($0 in a)
      delete a[$0]
   else
      print
}
END {
   for (i in a)
      print i
}' file1 file2

Similarly using grep you can get the same using:

{ grep -vxFf file1 file2; grep -vxFf file2 file1; }

Upvotes: 2

karakfa

Reputation: 67567

If the files are sorted (or can be sorted on the fly) you can also try join. Since you don't have good test input I'm showing on a toy example

$ seq 5 > f1
$ seq 3 9 > f2

this gives the common records in both files, same as comm -12 f1 f2

$ join f1 f2  
3
4
5

this gives the unmatched records in both files, same as comm -3 f1 f2 | sed 's/^\t//'

$ join -v1 -v2 f1 f2
1
2
6
7
8
9

Upvotes: 0

Webert Lima

Reputation: 14035

If I understood correctly, to simply sort the lines out based on any 'column', you can youse:

sort file1 file2 -t '/' -k 3 > file6

where -t '/' specifies the column delimiter, and -k 3 specifies the column number based on this delimiter.

As for the second question, if you just want to compare the files you try out the diff command and see if it helpful to you.

Upvotes: 0

Sort two files in Linux and find lines unique to each file

Answers (4)

Related Questions