AISWARYA JAYAPRAKASH
AISWARYA JAYAPRAKASH

Reputation: 63

Remove duplicated rows based on two columns (in both directions) and retain only one

I'd like to remove duplicated rows from a tab delimited file that look like this:

 arahy.Tifrunner.gnm1.ann1.A4HWYP   arahy.Tifrunner.gnm1.ann1.BWH72M
 arahy.Tifrunner.gnm1.ann1.A4HWYP   arahy.Tifrunner.gnm1.ann1.PK5V4W
 arahy.Tifrunner.gnm1.ann1.BWH72M   arahy.Tifrunner.gnm1.ann1.A4HWYP
 arahy.Tifrunner.gnm1.ann1.D7QF3J   arahy.Tifrunner.gnm1.ann1.A6ZB5M
 arahy.Tifrunner.gnm1.ann1.A6ZB5M   arahy.Tifrunner.gnm1.ann1.D7QF3J

based on columns 1 and 2 to get output file having one direction rows that look like this :

 arahy.Tifrunner.gnm1.ann1.A4HWYP   arahy.Tifrunner.gnm1.ann1.BWH72M
 arahy.Tifrunner.gnm1.ann1.A4HWYP   arahy.Tifrunner.gnm1.ann1.PK5V4W
 arahy.Tifrunner.gnm1.ann1.D7QF3J   arahy.Tifrunner.gnm1.ann1.A6ZB5M

I am trying the code

awk -F'\t' '!x[$2];{x[$1]++}' input.txt > out.txt

But its not giving the desired output instead its simply removing the last line of the original file. Is there any way of doing this using awk and/or sort?

Upvotes: 3

Views: 65

Answers (3)

Sundeep
Sundeep

Reputation: 23667

$ awk 'BEGIN{FS=OFS="\t"}
       !(($1 FS $2 in x) || ($2 FS $1 in x));
       {x[$1 FS $2]}' ip.txt
arahy.Tifrunner.gnm1.ann1.A4HWYP    arahy.Tifrunner.gnm1.ann1.BWH72M
arahy.Tifrunner.gnm1.ann1.A4HWYP    arahy.Tifrunner.gnm1.ann1.PK5V4W
arahy.Tifrunner.gnm1.ann1.D7QF3J    arahy.Tifrunner.gnm1.ann1.A6ZB5M
  • BEGIN{FS=OFS="\t"} set input and output field separator as tab character
  • {x[$1 FS $2]} use first and second field separated by tab as key
  • $1 FS $2 in x and $2 FS $1 in x checks if the first and second fields exist as key in any order

You can also simplify this to:

awk 'BEGIN{FS=OFS="\t"} !(($0 in x) || ($2 FS $1 in x)); {x[$0]}'

or (just realized that OFS isn't needed)

awk -F'\t' '!(($0 in x) || ($2 FS $1 in x)); {x[$0]}'

Upvotes: 4

James Brown
James Brown

Reputation: 37394

Another awk:

$ awk '!a[($1<$2?$1:$2),($1<$2?$2:$1)]++' file

Output:

arahy.Tifrunner.gnm1.ann1.A4HWYP        arahy.Tifrunner.gnm1.ann1.BWH72M
arahy.Tifrunner.gnm1.ann1.A4HWYP        arahy.Tifrunner.gnm1.ann1.PK5V4W
arahy.Tifrunner.gnm1.ann1.D7QF3J        arahy.Tifrunner.gnm1.ann1.A6ZB5M

In case the columns have space in them, you should use awk -F"\t"¸

Upvotes: 2

Guss
Guss

Reputation: 32315

I'm not an awk expert, so here's a solution in Bash, if you are interested:

declare -A db
while read line; do
  index="$(sed 's,[[:space:]]\+,\n,g' <<<"$line" | sort)"
  [ -z "${db[$index]}" ] && echo "$line"
  db[$index]=1
done < input.txt > out.txt

The trick here is to create an index that doesn't care about the order of values in the record, which I do by just converting each record to a set of lines, using sed, then running it through sort. Ideally sort would allow us to sort "words", but AFAIK it doesn't.

Upvotes: 0

Related Questions