abderrahim_05
abderrahim_05

Reputation: 465

awk remove duplicates based on two columns and custom duplication rule

I want to process a CSV input file like the following :

a;b
b;c
b;a
c;d
x;y
d;c

and remove both duplicate lines defined by the rule : a;b and b;a are considered duplicate and therefore should be removed, the same rule applies to c;d and d;c they shoud be removed.

I tried to process file twice and use the condition NR==FNR to figure which pass it is (first or second) but i can't figure out how to implement the test on the duplication rule i defined above.

please help me

Upvotes: 0

Views: 81

Answers (2)

karakfa
karakfa

Reputation: 67507

$ awk -F';' '{ks[$0]; a[$2 FS $1]++} END{for(k in ks) if(!a[k]) print k}' file

x;y
b;c

Upvotes: 1

tshiono
tshiono

Reputation: 22022

Would you please try the following:

awk -F';' '
NR==FNR {                                       # 1st pass
    if (seen[$1$2]++ || seen[$2$1]++) {         # if "ab" or "ba" already exists
        dupe[$1";"$2]++; dupe[$2";"$1]++        # then mark "a;b" and "b;a" as duplicates
    }
    next
}
! dupe[$0]                                      # print unless duplicates
' file file

Output:

b;c
x;y

Upvotes: 1

Related Questions