Reputation: 63
I'd like to remove duplicated rows from a tab delimited file that look like this:
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.BWH72M
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.PK5V4W
arahy.Tifrunner.gnm1.ann1.BWH72M arahy.Tifrunner.gnm1.ann1.A4HWYP
arahy.Tifrunner.gnm1.ann1.D7QF3J arahy.Tifrunner.gnm1.ann1.A6ZB5M
arahy.Tifrunner.gnm1.ann1.A6ZB5M arahy.Tifrunner.gnm1.ann1.D7QF3J
based on columns 1 and 2 to get output file having one direction rows that look like this :
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.BWH72M
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.PK5V4W
arahy.Tifrunner.gnm1.ann1.D7QF3J arahy.Tifrunner.gnm1.ann1.A6ZB5M
I am trying the code
awk -F'\t' '!x[$2];{x[$1]++}' input.txt > out.txt
But its not giving the desired output instead its simply removing the last line of the original file. Is there any way of doing this using awk and/or sort?
Upvotes: 3
Views: 65
Reputation: 23667
$ awk 'BEGIN{FS=OFS="\t"}
!(($1 FS $2 in x) || ($2 FS $1 in x));
{x[$1 FS $2]}' ip.txt
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.BWH72M
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.PK5V4W
arahy.Tifrunner.gnm1.ann1.D7QF3J arahy.Tifrunner.gnm1.ann1.A6ZB5M
BEGIN{FS=OFS="\t"}
set input and output field separator as tab character{x[$1 FS $2]}
use first and second field separated by tab as key$1 FS $2 in x
and $2 FS $1 in x
checks if the first and second fields exist as key in any orderYou can also simplify this to:
awk 'BEGIN{FS=OFS="\t"} !(($0 in x) || ($2 FS $1 in x)); {x[$0]}'
or (just realized that OFS isn't needed)
awk -F'\t' '!(($0 in x) || ($2 FS $1 in x)); {x[$0]}'
Upvotes: 4
Reputation: 37394
Another awk:
$ awk '!a[($1<$2?$1:$2),($1<$2?$2:$1)]++' file
Output:
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.BWH72M
arahy.Tifrunner.gnm1.ann1.A4HWYP arahy.Tifrunner.gnm1.ann1.PK5V4W
arahy.Tifrunner.gnm1.ann1.D7QF3J arahy.Tifrunner.gnm1.ann1.A6ZB5M
In case the columns have space in them, you should use awk -F"\t"
¸
Upvotes: 2
Reputation: 32315
I'm not an awk expert, so here's a solution in Bash, if you are interested:
declare -A db
while read line; do
index="$(sed 's,[[:space:]]\+,\n,g' <<<"$line" | sort)"
[ -z "${db[$index]}" ] && echo "$line"
db[$index]=1
done < input.txt > out.txt
The trick here is to create an index that doesn't care about the order of values in the record, which I do by just converting each record to a set of lines, using sed
, then running it through sort
. Ideally sort
would allow us to sort "words", but AFAIK it doesn't.
Upvotes: 0