Achal Neupane
Achal Neupane

Reputation: 5729

Sort the order of fasta sequence using python

I have a fasta file (consists of >header and sequence lines) as below:

myfasta

>S.sclerotiorum_Ch16_153_209
AACCCTAACCCTAACCCTTGATTGATTGATTGATTGATTGAT
TGATTGATGAAATTATAGTCTCCGTAAAGCAAATAAAGCATT
TAGTAAACGTTGAAGAGCTAGAAAAGCTTTAATACAAAAAGG
>S.sclerotiorum_Ch16_153_209
AACCCTAACCCTAACCCTTGATTGATTGATTGATTGATTGAT
TAGTAAACGTTGAAGAGCTAGAAAAGCTTTAATACAAAAAGG
>S.sclerotiorum_Ch14_442_1137
TGTCAATTCGATCTAGTATT
>S.sclerotiorum_Ch12_1831_180
AGAGCTAGAAAAGCTTTAAT
>S.sclerotiorum_Ch1_1831_180
AGAGCTAGAAAAGCTTTAATAGAGCTAGAAAAGCTTTAAT
AGAGCTAGAAAAGCTTTAATAGAGCTAGAAAAGCTTTAAT

I want to print this file in a defined order (starting with Ch1 to Ch16) and get the result as below:

>S.sclerotiorum_Ch1_1831_180
AGAGCTAGAAAAGCTTTAATAGAGCTAGAAAAGCTTTAAT
AGAGCTAGAAAAGCTTTAATAGAGCTAGAAAAGCTTTAAT
>S.sclerotiorum_Ch12_1831_180
AGAGCTAGAAAAGCTTTAAT
>S.sclerotiorum_Ch14_442_1137
TGTCAATTCGATCTAGTATT
>S.sclerotiorum_Ch16_153_209
AACCCTAACCCTAACCCTTGATTGATTGATTGATTGATTGAT
TAGTAAACGTTGAAGAGCTAGAAAAGCTTTAATACAAAAAGG
>S.sclerotiorum_Ch16_153_209
AACCCTAACCCTAACCCTTGATTGATTGATTGATTGATTGAT
TGATTGATGAAATTATAGTCTCCGTAAAGCAAATAAAGCATT
TAGTAAACGTTGAAGAGCTAGAAAAGCTTTAATACAAAAAGG

I tried to write this code, but I am still getting the same order as my input file in my result.fasta. Any help would be appreciated in fixing my code. Thanks!

Code: python code.py myfasta.fasta >> result.fasta

#!/usr/bin/env python
import sys
import os
import pathlib

myfasta = sys.argv[1]
fasta = open(myfasta)

#types = ['S.sclerotiorum_Ch16_', 'S.sclerotiorum_Ch15_', 'S.sclerotiorum_Ch14_', 'S.sclerotiorum_Ch13_', 'S.sclerotiorum_Ch12_', 'S.sclerotiorum_Ch11_', 'S.sclerotiorum_Ch10_', 'S.sclerotiorum_Ch9_', 'S.sclerotiorum_Ch8_', 'S.sclerotiorum_Ch7_', 'S.sclerotiorum_Ch6_', 'S.sclerotiorum_Ch5_', 'S.sclerotiorum_Ch4_', 'S.sclerotiorum_Ch3_', 'S.sclerotiorum_Ch2_', 'S.sclerotiorum_Ch1_']

types = ['S.sclerotiorum_Ch1', 'S.sclerotiorum_Ch2', 'S.sclerotiorum_Ch3', 'S.sclerotiorum_Ch4', 'S.sclerotiorum_Ch5', 'S.sclerotiorum_Ch6', 'S.sclerotiorum_Ch7', 'S.sclerotiorum_Ch8', 'S.sclerotiorum_Ch9', 'S.sclerotiorum_Ch10', 'S.sclerotiorum_Ch11', 'S.sclerotiorum_Ch12', 'S.sclerotiorum_Ch13', 'S.sclerotiorum_Ch14', 'S.sclerotiorum_Ch15', 'S.sclerotiorum_Ch16']




for type in range(len(types)):
    flag = False
    fasta = open(myfasta)
    for line in fasta:
        if line.startswith('>') and types[type] in line:
            flag = True
        elif line.startswith('>'):
          flag = False
        if flag:
            #grabbed = line.strip()
            #newfasta.writelines(grabbed + "\n")
            print(line.strip())

fasta.close

Upvotes: 1

Views: 793

Answers (1)

blhsing
blhsing

Reputation: 107134

You can use re.findall with a pattern that matches a line followed by lines that start with a non-> character and also groups the number after Ch, use sorted to sort the matches according to the number, and use str.join to join the sorted substrings back to a string:

import re
''.join(s for s, _ in sorted(re.findall(r'(.*_Ch(\d+)_.*\n(?:[^>].*\n)*)', f), key=lambda t: int(t[1])))

Given your input string stored in f (you should read the entire file into this variable first), this returns:

>S.sclerotiorum_Ch1_1831_180
AGAGCTAGAAAAGCTTTAATAGAGCTAGAAAAGCTTTAAT
>S.sclerotiorum_Ch12_1831_180
AGAGCTAGAAAAGCTTTAAT
>S.sclerotiorum_Ch14_442_1137
TGTCAATTCGATCTAGTATT
>S.sclerotiorum_Ch16_153_209
AACCCTAACCCTAACCCTTGATTGATTGATTGATTGATTGAT
TGATTGATGAAATTATAGTCTCCGTAAAGCAAATAAAGCATT
TAGTAAACGTTGAAGAGCTAGAAAAGCTTTAATACAAAAAGG
>S.sclerotiorum_Ch16_153_209
AACCCTAACCCTAACCCTTGATTGATTGATTGATTGATTGAT
TAGTAAACGTTGAAGAGCTAGAAAAGCTTTAATACAAAAAGG

Upvotes: 1

Related Questions