Reputation: 905
Greetings,
i got into the following problem:
Given a file of the following structure:
'>some cookies
chocolatejelly
peanutbuttermacadamia
doublecoconutapple
'>some icecream
cherryvanillaamaretto
peanuthaselnuttiramisu
bananacoffee
'>some other stuff
letsseewhatfancythings
wegotinhere
Aim: putting in all entries after every line containing '>' into a list as a single string
Code:
def parseSequenceIntoDictionary(filename):
lis=[]
seq=''
with open(filename, 'r') as fp:
for line in fp:
if('>' not in line):
seq+=line.rstrip()
elif('>' in line):
lis.append(seq)
seq=''
lis.remove('')
return lis
So this function goes through each line of the file if there is not the occurrence of an '>' it concatenates all following lines and removes the '\n', if an '>' occurs, it automatically appends the concatenated string to the list and 'clears' the string 'seq' for concatenating the next sequence
The problem: To take the example of an input file, it only puts the stuff from 'some cookies' and 'some icecream' into the list - but not from 'some other stuff'. So we get as an result:
[chocolatejelly
peanutbuttermacadamia
doublecoconutapple, cherryvanillaamaretto
peanuthaselnuttiramisu
bananacoffee] but not
[chocolatejelly
peanutbuttermacadamia
doublecoconutapple, cherryvanillaamaretto
peanuthaselnuttiramisu
bananacoffee, letsseewhatfancythings
wegotinhere]
What is the wrong thought in here? There is some logic mistake in the iteration I may not have taken care, but I do not know where.
Thanks in advance for any hints!
Upvotes: 0
Views: 4688
Reputation: 27585
import re
def parseSequenceIntoDictionary(filename,regx = re.compile('^.*>.*$',re.M)):
with open(filename) as f:
for el in regx.split(f.read()):
if el:
yield el.replace('\n','')
print list(parseSequenceIntoDictionary('aav.txt'))
Upvotes: 0
Reputation: 107736
The problem is that you only store the current section seq
when you hit a line with '>'
in it. When the file ends, you still have that section open, but you don't store it.
The simplest way to fix your program is this:
def parseSequenceIntoDictionary(filename):
lis=[]
seq=''
with open(filename, 'r') as fp:
for line in fp:
if('>' not in line):
seq+=line.rstrip()
elif('>' in line):
lis.append(seq)
seq=''
# the file ended
lis.append(seq) # store the last section
lis.remove('')
return lis
Btw, you should use if line.startswith("'>"):
to prevent a possible bug.
Upvotes: 2
Reputation: 206
my_list = []
with open('file_in.txt') as f:
for line in f:
if line.startswith("'>"):
my_list.append(line.strip().split("'>")[1])
print my_list #['some cookies', 'some icecream', 'some other stuff']
Upvotes: 1
Reputation: 15722
You only append seq to the result list if a new line with > is found. So at the end you have a filled seq (with the data you are missing), but you don't add it to the result list. So after your loop just add seq if there is some data in it and you should be fine.
Upvotes: 1
Reputation: 25609
well, you can simply split on '>
(if i get you correct)
>>> s="""
... '>some cookies
... chocolatejelly
... peanutbuttermacadamia
... doublecoconutapple
... '>some icecream
... cherryvanillaamaretto
... peanuthaselnuttiramisu
... bananacoffee
... '>some other stuff
... letsseewhatfancythings
... wegotinhere """
>>> s.split("'>")
['\n', 'some cookies \nchocolatejelly \npeanutbuttermacadamia \ndoublecoconutapple \n', 'some icecream \ncherryvanillaamaretto \npeanuthaselnuttiramisu \nbananacoffee \n', 'some other stuff \nletsseewhatfancythings \nwegotinhere ']
>>>
Upvotes: 0