MRM
MRM

Reputation: 23

Change sequence SeqIds in multi-line fasta file

I have a multiplex fasta file and I need to change the Ids of my sequence. So now my fasta file looks like this:

>Mafalda01_2759;barcodelabel=CAR1_01
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTTTTAAGTCTGATGTGAAATCTC
ATAGCTTAACTATGAGCGGTCATTGGAAACTGGAGAACTTGAGTATAGAAGAGAAGAGTGGAATTCCAAGTGTAGCGGTG
AAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGACTCTTTGGTCTATTACTGACGCTGAGGTACGAAAGCGTG
GGGAGCAAAC
>Mafalda02_51112;barcodelabel=CAR7_04
TACGTAGGGAGCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGCAGTAAGTTACTTGTGAAATCTC
TGGGCTTAACCCAGAACGGCCAAGTAATACTGCAGTGCTAGAGTGCAGAAAGGGCAATCGGAATTCTTGGTGTAGCGGTG
AAATGCGTAGATATCAAGAGGAACACCTGAGGCGAAGGCGGGTTGCTTGTCTGACACTGACGCTGAGGCGCGAAAGCCAG
GGGAGCAAAC
>Mafalda01_145359;barcodelabel=CAC11_86
TACGGAGGATCCAAGCGTTATCCGGAATCATTGGGTTTAAAGGGTCCGTAGGCGGACAATTAAGTCAGCGGTGAAAGTCT
GTAGCTCAACTATAGAACTGCCGTTGATACTGGTTGTCTTGAATCAATGTGAAGTGGCTAGAATATGTGGTGTAGCGGTG
AAATGCTTAGATATCACATAGAACACCGATTGCGAAGGCAGGTCACTAACATTGCATTGACGCTGATGGACGAAAGCGTG
GGGAGCGAAC
>Mafalda02_3119;barcodelabel=CAR4_03
TACGGGGGGTGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGTGGTCTTATAAGGGTGTGGTGAAAGCCC
GGGGCTCAACCCCGGGTCGGCCGTGCCGACTGTGAGACTAGAGTGCTGTAGGGGCAGGCGGAATTCCGGGTGTAGCGGTG
GAATGCGTAGAGATCCGGAGGAAGACCGGTGGCGAAGGCGGCCTGCTGGGCAGATACTGACACTGAGGCGCGACAGCGTG
GGGAGCAAAC

What I want to do is to remove everything up to the sign "=" (except the ">"), remove the "[number] after the barcode label and add "[sequencial number]" for each Id of each sequence

Like:

>CAR1_1
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTTTTAAGTCTGATGTGAAATCTC
ATAGCTTAACTATGAGCGGTCATTGGAAACTGGAGAACTTGAGTATAGAAGAGAAGAGTGGAATTCCAAGTGTAGCGGTG
AAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGACTCTTTGGTCTATTACTGACGCTGAGGTACGAAAGCGTG
GGGAGCAAAC
>CAR7_2
TACGTAGGGAGCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGCAGTAAGTTACTTGTGAAATCTC
TGGGCTTAACCCAGAACGGCCAAGTAATACTGCAGTGCTAGAGTGCAGAAAGGGCAATCGGAATTCTTGGTGTAGCGGTG
AAATGCGTAGATATCAAGAGGAACACCTGAGGCGAAGGCGGGTTGCTTGTCTGACACTGACGCTGAGGCGCGAAAGCCAG
GGGAGCAAAC

(and so on and so for...)

Would that be possible?

Upvotes: 0

Views: 933

Answers (2)

Vince
Vince

Reputation: 3395

In awk:

awk 'BEGIN { c=1 } $1 ~ /^>/ { s=gensub(/.*=([A-Z0-9]+)_[0-9]+$/,">\\1_"c,"g",$1); print s; c+=1 } $1 !~ /^>/ { print }' seqs.fa

First, inistalize counter for sequence record:

BEGIN { c = 1 }

The interesting bit is:

$1 ~ /^>/ { s=gensub(/.*=([A-Z0-9]+)_[0-9]+$/,">\\1_"c,"g",$1); print s; c+=1 }

$1 ~ /^>/ { ... } will only match on lines with leading >.

Then, gensub captures whatever is after the last = into \\1, excluding the trailing _[0-9]+, then prints ">\\1_"c, where c is value of counter initialized above. Print string s, then increment counter.

Second bit of code is:

$1 !~ /^>/ { print }

If line does not have a leading >, just print it.

Upvotes: 2

Maximilian Peters
Maximilian Peters

Reputation: 31659

A simple solution in Python would be:

sequences = """>Mafalda01_2759;barcodelabel=CAR1_01
TACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGTACGCAGGCGGTTTTTTAAGTCTGATGTGAAATCTC
ATAGCTTAACTATGAGCGGTCATTGGAAACTGGAGAACTTGAGTATAGAAGAGAAGAGTGGAATTCCAAGTGTAGCGGTG
AAATGCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGACTCTTTGGTCTATTACTGACGCTGAGGTACGAAAGCGTG
GGGAGCAAAC
>Mafalda02_51112;barcodelabel=CAR7_04
TACGTAGGGAGCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGCGCAGTAAGTTACTTGTGAAATCTC
TGGGCTTAACCCAGAACGGCCAAGTAATACTGCAGTGCTAGAGTGCAGAAAGGGCAATCGGAATTCTTGGTGTAGCGGTG
AAATGCGTAGATATCAAGAGGAACACCTGAGGCGAAGGCGGGTTGCTTGTCTGACACTGACGCTGAGGCGCGAAAGCCAG
GGGAGCAAAC
>Mafalda01_145359;barcodelabel=CAC11_86
TACGGAGGATCCAAGCGTTATCCGGAATCATTGGGTTTAAAGGGTCCGTAGGCGGACAATTAAGTCAGCGGTGAAAGTCT
GTAGCTCAACTATAGAACTGCCGTTGATACTGGTTGTCTTGAATCAATGTGAAGTGGCTAGAATATGTGGTGTAGCGGTG
AAATGCTTAGATATCACATAGAACACCGATTGCGAAGGCAGGTCACTAACATTGCATTGACGCTGATGGACGAAAGCGTG
GGGAGCGAAC
>Mafalda02_3119;barcodelabel=CAR4_03
TACGGGGGGTGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGTGGTCTTATAAGGGTGTGGTGAAAGCCC
GGGGCTCAACCCCGGGTCGGCCGTGCCGACTGTGAGACTAGAGTGCTGTAGGGGCAGGCGGAATTCCGGGTGTAGCGGTG
GAATGCGTAGAGATCCGGAGGAAGACCGGTGGCGAAGGCGGCCTGCTGGGCAGATACTGACACTGAGGCGCGACAGCGTG
GGGAGCAAAC"""

index = 0
for line in sequences.splitlines():
    if line.startswith('>'):
        index += 1
        output = line[line.find('barcodelabel=') + len('barcodelabel='):]
        output = output[:output.find('_') + 1]
        print('>' + output + str(index))
    else:
        print(line)

You could read your sequences from a file and store it in sequences or replace for line in with

fasta_file = open('your_seqs.fasta', 'r')
    for line in fasta_file':

The possibilities are endless.....

Upvotes: 0

Related Questions