Reputation: 57
How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'
from sequences?
I have this code:
with open('sequence.fasta', 'r') as f :
while True:
line1=f.readline()
line2=f.readline()
line3=f.readline()
if not line3:
break
fct([line1[i:i+100] for i in range(0, len(line1), 100)])
fct([line2[i:i+100] for i in range(0, len(line2), 100)])
fct([line3[i:i+100] for i in range(0, len(line3), 100)])
Output:
['>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n']
['CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n']
['AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n']
['CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA\n']
['AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA\n']
['ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT\n']
['AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA\n']
['GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC\n']
['AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT\n']
['TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT\n']
['GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT\n']
['GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC\n']
['\n']
...
My function is:
def fct(input_string):
code={"a":0,"c":1,"g":2,"t":3}
p=[code[i] for i in input_string]
n=len(input_string)
c=0
for i, n in enumerate(range(n, 0, -1)):
c +=p[i]*(4**(n-1))
return c+1
fct()
returns an integer from a string. For example, ACT
gives 8
i.e.: my function must take as input string sequences contain just the following bases A,C,G,T
But when I use my function it gives:
KeyError: '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'
I try to remove ids by stripping lines start with >
and writing the rest in text file so, my text file output.txt
contains just sequences without ids, but when I use my function fct I found the same error:
KeyError: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n'
What can I do?
Upvotes: 1
Views: 1123
Reputation: 17333
I see two major problems in your code: You're having problems parsing FASTA sequences, and your function is not properly iterating over each sequence.
Might I suggest using the excellent Biopython package? It has excellent FASTA support (reading and writing) built in (see Sequences in the Tutorial).
To parse sequences from a FASTA file:
for seq_record in SeqIO.parse("seqs.fasta", "fasta"):
print record.description # gi|2765658|emb|Z78533.1...
print record.seq # a Seq object, call str() to get a simple string
>>> print record.id
'gi|2765658|emb|Z78533.1|CIZ78533'
>>> print record.description
'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA'
>>> print record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
>>> print str(record.seq)
'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACC' #(truncated)
In your code, you have a list of strings being passed to fct()
(input_string
is not actually a string, but a list of strings). The solution is just to build one input string, and iterate over that.
fct
:c
is returned immediately.p
when you can just index into code
when iterating over the sequence?n
) by using it in your for
loop as a variable nameModified code (with proper PEP 8 formatting), and variables renamed to be clearer what they mean (still have no idea what c
is supposed to be):
from Bio import SeqIO
def dna_seq_score(dna_seq):
nucleotide_code = {"A": 0, "C": 1, "G": 2, "T": 3}
c = 0
for i, k in enumerate(range(len(dna_seq), 0, -1)):
nucleotide = dna_seq[i]
code_num = nucleotide_code[nucleotide]
c += code_num * (4 ** (k - 1))
return c + 1
for record in SeqIO.parse("test.fasta", "fasta"):
dna_seq_score(record.seq)
Upvotes: 4