user3660245
user3660245

Reputation: 123

Eliminate equally named paragraphs

I would like to eliminate equally named paragraphs (containing different strings of data, DNA in my case).

For example my file is:

>blue
1. agccttgatcgttac
2. tttactaaagatgat
3. agccttga
>orange
1. tttactaaagatg
2. agccttgatcgtt
3. tttacta
>blue
1. caatgcatgcaga 
2. agccttgatcgtt
3. tttactaaagatg
4. caatgca

I would like to remove all equally named paragraphs, leaving only one of them (in this case one of the ">"blue). Each paragraph starts with ">". How could I do it?

Upvotes: 1

Views: 80

Answers (4)

Ed Morton
Ed Morton

Reputation: 203985

$ awk '/^>/{seen=cnt[$0]++} !seen' file
>blue
1. agccttgatcgttac
2. tttactaaagatgat
3. agccttga
>orange
1. tttactaaagatg
2. agccttgatcgtt
3. tttacta

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 754450

This is a simple job for awk:

awk '/^>/ { print_it = 0; if (seen[$1]++ == 0) print_it = 1 }
          { if (print_it) print }'

This keeps the first paragraph with a given title. If you need to keep the last such paragraph, you have to work a lot harder.

Upvotes: 0

repzero
repzero

Reputation: 8402

using awk

awk -v RS="\">\"" '{c=0;name=name" "$1;split(name,arr);for(i in arr){if(arr[i]==$1){++c}};if(c==1){print RS $0;next}}' file > new_file

example if you have the data

">"orange
    tttactaaagatg
    agccttgatcgtt
    tttacta
">"blue
    caatgcatgcaga
    agccttgatcgtt
    tttactaaagatg
    caatgca
">"blue
    caatgcatgcaga
    agccttgatcgtt
    tttactaaagatg
    caatgca
">"orange

    tttactaaagatg
    agccttgatcgtt
    tttacta
">"green

    tttactaaagatg
    agccttgatcgtt
    tttacta

results

">"orange
    tttactaaagatg
    agccttgatcgtt
    tttacta

">"blue
    caatgcatgcaga
    agccttgatcgtt
    tttactaaagatg
    caatgca

">"green

    tttactaaagatg
    agccttgatcgtt
    tttacta

Upvotes: 0

David Jashi
David Jashi

Reputation: 4511

I'm sure, colleagues may offer more elegant way, but here is quick and dirty one:

cat in.txt |grep "^>"|sort|awk ' p == $0; { p = $0 }' >headers.txt
cp in.txt out.txt
while read in; do
    cat out.txt| sed "/^$in/,/^>/{//!d}"|sed "/^$in/d" >temp.txt
    mv temp.txt out.txt
done < headers.txt

Given in.txt as an input file, you get out.txt as output and list of deleted paragraph names in headers.txt.

Note, that I delete ALL occurrences of duplicate-named paragraphs.

Upvotes: 0

Related Questions