Reputation: 76
Newbie here, I need to remove the rows that contain duplicates in two columns (i.e. row1 and row2 has the same value in column 1, delete one of the rows and keep one and do the same for column one) Files are tab delimited
Here is example data
580615 580795 Del
580769 580795 Del
656123 657154 Del
656123 657195 Del
expected ouput
580769 580795 Del
656123 657154 Del
I am using Bash and this is a intermediary step in a pipeline I am developing.
I have tried to use this
awk 'seen[$1, $2]++ == 1' file
and
awk 'n=x[$1,$2]{print n"\n"$0;} {x[$1,$2]=$0;}' file
but I don't get any output.
Any suggestions will be appreciated Thanks!
Upvotes: 2
Views: 1726
Reputation: 221
$ cat file
580615 580795 Del
580769 580795 Del
656123 657154 Del
656123 657195 Del
using sort
:
$ sort -uk1,1 file | sort -uk2,2
-k1,1
sorts the 1st row and deletes the duplicates, then
-k2,2
sorts the 2nd row and removes duplicates
using sort
and uniq
:
$ sort -uk1,1 file | uniq -f1
Ouput:
580615 580795 Del
656123 657154 Del
If add -r
to sort
$ sort -uk1,1r file | sort -uk2,2
then ouput
580769 580795 Del
656123 657154 Del
Upvotes: 0
Reputation: 784958
You can use awk
like this:
awk '!a[$1]++ && !b[$2]++' file
580615 580795 Del
656123 657154 Del
This keeps 2 associative arrays a
and b
with unique values of column 1 and column 2.
Upvotes: 1
Reputation: 10912
If I understand correctly you can do:
awk '{ f[$1]+=1; s[$2]+=1; if(f[$1]==1 && s[$2]==1) print $0;}' file
You capture every line and count the number of occurrences for each column (first and second). If the two columns are new then we print the line.
Upvotes: 0