priyanka
priyanka

Reputation: 244

removing duplicates using awk in unix

My file is in the format

>id1
sequence1
>id2
sequence2
>id1
sequence3

the output i want is:

>id1
sequence1
>id2
sequence2

i.e. I need to remove sequences and id both in pairs if id is repeat.

I tried the following code, but it doesnt work.

awk '{
if(NR%2 == 1)
{
    fastaheader = $0; x[fasta_header] = x[fasta_header] + 1; 
}
else 
{
    seq = $0; {if(x[fasta_header] <= 1) {print fasta_header;print seq;}}
}
}' filename.txt

Upvotes: 0

Views: 1200

Answers (4)

ray
ray

Reputation: 4267

I prefer awk, you don't need pipe, and it prints lines in the sequence they appear in original file.

If you don't mind the line sequence, you can use sort

xargs -n2 < file  | sort -uk1,1 | xargs -n1

Upvotes: 0

Jotne
Jotne

Reputation: 41460

This should do:

awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2

Using the RS=">" changes the record to include both id and sequence.

awk '{$1=$1}1' RS=">"
id1 sequence1
id2 sequence2
id1 sequence1

Then the array removes all duplicate

The last awk '!/^>?$/' just removes some blank spaces and an extra >


cat file2
>id1
sequence1
>id2
sequence2
>id1
sequence3

This file should be intact, since the number in sequence are all difference.

awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file2 | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2
>id1
sequence3

Upvotes: 1

William Pursell
William Pursell

Reputation: 212654

Assuming your ids and sequences are always exactly one line:

awk 'NR%2 && !a[$0]++ { print; getline l ; print l }' input

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 755054

It looks as though the ID lines start with >. Given the order of the output, you want the first sequence associated with a given ID, not the last. This means you need something like:

awk '/^>/ { if (id[$1]++ == 0) printing = 1; else printing = 0 }
          { if (printing) print }'

The first line decides whether the current ID is unique and sets printing to 1 if it is, and 0 otherwise. The second line notes whether printing is required, and prints appropriately. Note that if there's more than one line of data in the sequence, it is quite happy to print all those lines. It does not rely on there being just one line in the sequence data.

Upvotes: 1

Related Questions