greeknlatin
greeknlatin

Reputation: 47

Comparing multiple columns within same file

I have to compare two columns col1 and col2 such that if A is occurring with B and again the same pair is occurring as B followed by A, it should print only one pair with all the following columns

Input file:
A   B   13.2    0.24    posx    209 215 posy    145 155
B   A   13.2    0.24    posy    145 155 posx    209 215
A   D   19.4    0.28    posx    209 215 posz    366 368


Required output:
A   B   13.2    0.24    posx    209 215 posy    145 155
A   D   19.4    0.28    posx    209 215 posz    366 368

input file is very huge (~10gb).

Upvotes: 1

Views: 219

Answers (3)

Ed Morton
Ed Morton

Reputation: 203393

$ awk '!a[$1,$2];{a[$2,$1]++}' file      
A   B   13.2    0.24    posx    209 215 posy    145 155
A   D   19.4    0.28    posx    209 215 posz    366 368

Normally a would be named seen but I'm partially playing golf with @jaypal's answer so need to keep my strokes down :-).

The important difference between the 2 answers lies in how they'd treat a second line that starts with the same 2 key values as a previous line. jaypals answer excludes lines that match a previously seen $1 and $2 in any order so it would remove duplicates whereas mine strictly adheres to the posted question and only removes subsequent lines that have previously seen keys in the reverse (i.e. current $1 $2 = previous $2 $1).

To enhance the above to exclude duplicates would be (as an alternative):

$ awk '!a[$1,$2]++;{a[$2,$1]++}' file

Chances are there's never duplicates in the input anyway so it probably doesn't matter either way.

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77095

Here is one way using awk:

awk '!(a[$1,$2]++ || a[$2,$1]++)' file
A   B   13.2    0.24    posx    209 215 posy    145 155
A   D   19.4    0.28    posx    209 215 posz    366 368

We keep track of column 1 and column 2 by using them as keys to our array a. ++ increments the value of our keys whenever they are encountered. || is a short circuit operator which only gets triggered for second condition if the first condition is false.

We negate the output of our condition by using !. Since awk default behavior is to print the line on truth we use that to avoid explicit print statement.

Upvotes: 5

mpapec
mpapec

Reputation: 50637

It takes first two values from each line and forms sorted key which is used to filter out the duplicates,

perl -ane '@k = sort @F[0,1]; $s{"@k"}++ or print' file

output

A   B   13.2    0.24    posx    209 215 posy    145 155
A   D   19.4    0.28    posx    209 215 posz    366 368

Upvotes: 4

Related Questions