Reputation: 47
I have to compare two columns col1 and col2 such that if A is occurring with B and again the same pair is occurring as B followed by A, it should print only one pair with all the following columns
Input file:
A B 13.2 0.24 posx 209 215 posy 145 155
B A 13.2 0.24 posy 145 155 posx 209 215
A D 19.4 0.28 posx 209 215 posz 366 368
Required output:
A B 13.2 0.24 posx 209 215 posy 145 155
A D 19.4 0.28 posx 209 215 posz 366 368
input file is very huge (~10gb).
Upvotes: 1
Views: 219
Reputation: 203393
$ awk '!a[$1,$2];{a[$2,$1]++}' file
A B 13.2 0.24 posx 209 215 posy 145 155
A D 19.4 0.28 posx 209 215 posz 366 368
Normally a
would be named seen
but I'm partially playing golf with @jaypal's answer so need to keep my strokes down :-).
The important difference between the 2 answers lies in how they'd treat a second line that starts with the same 2 key values as a previous line. jaypals answer excludes lines that match a previously seen $1 and $2 in any order so it would remove duplicates whereas mine strictly adheres to the posted question and only removes subsequent lines that have previously seen keys in the reverse (i.e. current $1 $2 = previous $2 $1).
To enhance the above to exclude duplicates would be (as an alternative):
$ awk '!a[$1,$2]++;{a[$2,$1]++}' file
Chances are there's never duplicates in the input anyway so it probably doesn't matter either way.
Upvotes: 1
Reputation: 77095
Here is one way using awk
:
awk '!(a[$1,$2]++ || a[$2,$1]++)' file
A B 13.2 0.24 posx 209 215 posy 145 155
A D 19.4 0.28 posx 209 215 posz 366 368
We keep track of column 1 and column 2 by using them as keys to our array a
. ++
increments the value of our keys whenever they are encountered. ||
is a short circuit operator which only gets triggered for second condition if the first condition is false.
We negate the output of our condition by using !
. Since awk
default behavior is to print the line on truth we use that to avoid explicit print
statement.
Upvotes: 5
Reputation: 50637
It takes first two values from each line and forms sorted key which is used to filter out the duplicates,
perl -ane '@k = sort @F[0,1]; $s{"@k"}++ or print' file
output
A B 13.2 0.24 posx 209 215 posy 145 155
A D 19.4 0.28 posx 209 215 posz 366 368
Upvotes: 4