Reputation: 905

Python: Putting specific lines of a file into a list

Greetings,

i got into the following problem:

Given a file of the following structure:

'>some cookies  
chocolatejelly  
peanutbuttermacadamia  
doublecoconutapple  
'>some icecream  
cherryvanillaamaretto  
peanuthaselnuttiramisu  
bananacoffee  
'>some other stuff  
letsseewhatfancythings  
wegotinhere

Aim: putting in all entries after every line containing '>' into a list as a single string

Code:

def parseSequenceIntoDictionary(filename):
    lis=[]
    seq=''
    with open(filename, 'r') as fp:
        for line in fp:
            if('>' not in line):
                seq+=line.rstrip()
            elif('>' in line):
                lis.append(seq)
                seq=''
        lis.remove('')
        return lis

So this function goes through each line of the file if there is not the occurrence of an '>' it concatenates all following lines and removes the '\n', if an '>' occurs, it automatically appends the concatenated string to the list and 'clears' the string 'seq' for concatenating the next sequence

The problem: To take the example of an input file, it only puts the stuff from 'some cookies' and 'some icecream' into the list - but not from 'some other stuff'. So we get as an result:

[chocolatejelly 
peanutbuttermacadamia 
doublecoconutapple, cherryvanillaamaretto 
peanuthaselnuttiramisu 
bananacoffee] but not  

[chocolatejelly 
peanutbuttermacadamia 
doublecoconutapple, cherryvanillaamaretto 
peanuthaselnuttiramisu 
bananacoffee, letsseewhatfancythings 
wegotinhere]

What is the wrong thought in here? There is some logic mistake in the iteration I may not have taken care, but I do not know where.

Thanks in advance for any hints!

Upvotes: 0

Answers (5)

eyquem

Reputation: 27585

import re

def parseSequenceIntoDictionary(filename,regx = re.compile('^.*>.*$',re.M)):
    with open(filename) as f:
        for el in regx.split(f.read()):
            if el:
                yield el.replace('\n','')

print list(parseSequenceIntoDictionary('aav.txt'))

Upvotes: 0

Jochen Ritzel

Reputation: 107736

The problem is that you only store the current section seq when you hit a line with '>' in it. When the file ends, you still have that section open, but you don't store it.

The simplest way to fix your program is this:

def parseSequenceIntoDictionary(filename):
    lis=[]
    seq=''
    with open(filename, 'r') as fp:
        for line in fp:
            if('>' not in line):
                seq+=line.rstrip()
            elif('>' in line):
                lis.append(seq)
                seq=''
        # the file ended
        lis.append(seq) # store the last section
        lis.remove('')
        return lis

Btw, you should use if line.startswith("'>"): to prevent a possible bug.

Upvotes: 2

snippsat

Reputation: 206

my_list = []
with open('file_in.txt') as f:
    for line in f:
        if line.startswith("'>"):
            my_list.append(line.strip().split("'>")[1])

print my_list  #['some cookies', 'some icecream', 'some other stuff']

Upvotes: 1

Achim

Reputation: 15722

You only append seq to the result list if a new line with > is found. So at the end you have a filled seq (with the data you are missing), but you don't add it to the result list. So after your loop just add seq if there is some data in it and you should be fine.

Upvotes: 1

kurumi

Reputation: 25609

well, you can simply split on '> (if i get you correct)

>>> s="""
... '>some cookies
... chocolatejelly
... peanutbuttermacadamia
... doublecoconutapple
... '>some icecream
... cherryvanillaamaretto
... peanuthaselnuttiramisu
... bananacoffee
... '>some other stuff
... letsseewhatfancythings
... wegotinhere  """
>>> s.split("'>")
['\n', 'some cookies  \nchocolatejelly  \npeanutbuttermacadamia  \ndoublecoconutapple  \n', 'some icecream  \ncherryvanillaamaretto  \npeanuthaselnuttiramisu  \nbananacoffee  \n', 'some other stuff  \nletsseewhatfancythings  \nwegotinhere  ']
>>>

Upvotes: 0

Python: Putting specific lines of a file into a list

Answers (5)

Related Questions