Gandalf the Grey
Gandalf the Grey

Reputation: 76

AWK Identify duplicates from two columns but print the first instance

Newbie here, I need to remove the rows that contain duplicates in two columns (i.e. row1 and row2 has the same value in column 1, delete one of the rows and keep one and do the same for column one) Files are tab delimited

Here is example data

580615  580795  Del
580769  580795  Del
656123  657154  Del
656123  657195  Del

expected ouput

580769  580795  Del
656123  657154  Del

I am using Bash and this is a intermediary step in a pipeline I am developing.

I have tried to use this

awk 'seen[$1, $2]++ == 1' file 

and

awk 'n=x[$1,$2]{print n"\n"$0;} {x[$1,$2]=$0;}' file

but I don't get any output.

Any suggestions will be appreciated Thanks!

Upvotes: 2

Views: 1726

Answers (3)

zombic
zombic

Reputation: 221

$ cat file

580615  580795  Del
580769  580795  Del
656123  657154  Del
656123  657195  Del
  1. using sort:

    $ sort -uk1,1 file | sort -uk2,2
    

-k1,1 sorts the 1st row and deletes the duplicates, then

-k2,2 sorts the 2nd row and removes duplicates

  1. using sort and uniq:

    $ sort -uk1,1 file | uniq -f1
    

Ouput:

580615  580795  Del
656123  657154  Del

If add -r to sort

$ sort -uk1,1r file | sort -uk2,2

then ouput

580769  580795  Del
656123  657154  Del

Upvotes: 0

anubhava
anubhava

Reputation: 784958

You can use awk like this:

awk '!a[$1]++ && !b[$2]++' file

580615  580795  Del
656123  657154  Del

This keeps 2 associative arrays a and b with unique values of column 1 and column 2.

Upvotes: 1

cglacet
cglacet

Reputation: 10912

If I understand correctly you can do:

awk '{ f[$1]+=1; s[$2]+=1; if(f[$1]==1 && s[$2]==1) print $0;}' file

You capture every line and count the number of occurrences for each column (first and second). If the two columns are new then we print the line.

Upvotes: 0

Related Questions