Reputation: 244
My file is in the format
>id1
sequence1
>id2
sequence2
>id1
sequence3
the output i want is:
>id1
sequence1
>id2
sequence2
i.e. I need to remove sequences and id both in pairs if id is repeat.
I tried the following code, but it doesnt work.
awk '{
if(NR%2 == 1)
{
fastaheader = $0; x[fasta_header] = x[fasta_header] + 1;
}
else
{
seq = $0; {if(x[fasta_header] <= 1) {print fasta_header;print seq;}}
}
}' filename.txt
Upvotes: 0
Views: 1200
Reputation: 4267
I prefer awk
, you don't need pipe, and it prints lines in the sequence they appear in original file.
If you don't mind the line sequence, you can use sort
xargs -n2 < file | sort -uk1,1 | xargs -n1
Upvotes: 0
Reputation: 41460
This should do:
awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2
Using the RS=">"
changes the record to include both id
and sequence.
awk '{$1=$1}1' RS=">"
id1 sequence1
id2 sequence2
id1 sequence1
Then the array removes all duplicate
The last awk '!/^>?$/'
just removes some blank spaces and an extra >
cat file2
>id1
sequence1
>id2
sequence2
>id1
sequence3
This file should be intact, since the number in sequence are all difference.
awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file2 | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2
>id1
sequence3
Upvotes: 1
Reputation: 212654
Assuming your ids and sequences are always exactly one line:
awk 'NR%2 && !a[$0]++ { print; getline l ; print l }' input
Upvotes: 1
Reputation: 755054
It looks as though the ID lines start with >
. Given the order of the output, you want the first sequence associated with a given ID, not the last. This means you need something like:
awk '/^>/ { if (id[$1]++ == 0) printing = 1; else printing = 0 }
{ if (printing) print }'
The first line decides whether the current ID is unique and sets printing
to 1 if it is, and 0 otherwise. The second line notes whether printing is required, and prints appropriately. Note that if there's more than one line of data in the sequence, it is quite happy to print all those lines. It does not rely on there being just one line in the sequence data.
Upvotes: 1