Tim Stack
Tim Stack

Reputation: 3248

Populating a dictionary with multiple lines as one string

I have a file with multiple lines in FASTA format, which I want to break up in pieces and populate a dictionary with these pieces.

>piece_1 
Lorem ipsum dolor sit amet
consectetur adipiscing elit. Nam a pellentesque mi. 
>piece_2 
Integer dignissim ultrices eros a consequat. Praesent vestibulum
>piece_3 
Morbi eget sollicitudin mauris. Nunc varius felis 
vitae dui congue hendrerit. Nam semper venenatis auctor.  
Suspendisse potenti. Suspendisse facilisis velit vel convallis 
fringilla. Duis condimentum auctor mauris eu lobortis. 

I want to create, from the text above, a dictionary which contains all separate pieces of text with the keys being >piece_1 etc.

So far I managed to populate a dictionary with all keys, but I can't tell how to extract the texts from the file.

f = open('Output.txt', 'r')
mydict = dict()

for index, line in enumerate(f):
    if line[:1]=='>':
        mydict[index] = line #instead, the key should be line with the value being the relative text.
        print(line, end='')

Upvotes: 1

Views: 2126

Answers (4)

pktl2k
pktl2k

Reputation: 650

Here's another compact possibility using list and dict comprehensions:

with open('Output.txt', 'r') as f:
    s = f.read()
result = {k.strip(): v for k, v in [part.split('\n', maxsplit=1)
                                    for part in s.split('>')[1:]] }

In the inner list comprehension: The 0th list element that s.split('>') returns is an empty string, so we ignore it. maxsplit=1 in the subsequent split at \n prevents splitting the text in more than 2 pieces.

Upvotes: 0

Rakesh
Rakesh

Reputation: 82785

This is one approach using a simple iteration.

Ex:

result = []
with open(filename) as infile:
    for line in infile:
        if line.startswith(">"):             #Check if line starts with '>'
            result.append([line, []])        #Create new list with format --> [key, [list of corresponding text]]
        else:
            result[-1][1].append(line)       #Append text to previously found key. 

mydict ={k: "".join(v) for k, v in result}   #Form required dictionary. 
print(mydict)

Output:

{'>piece_1 \n': 'Lorem ipsum dolor sit amet\nconsectetur adipiscing elit. Nam a pellentesque mi. \n',
 '>piece_2 \n': 'Integer dignissim ultrices eros a consequat. Praesent vestibulum\n',
 '>piece_3 \n': 'Morbi eget sollicitudin mauris. Nunc varius felis \nvitae dui congue hendrerit. Nam semper venenatis auctor.  \nSuspendisse potenti. Suspendisse facilisis velit vel convallis \nfringilla. Duis condimentum auctor mauris eu lobortis. '}

Upvotes: 1

Chris_Rands
Chris_Rands

Reputation: 41188

I suggest using Biopython, it will be more robust and concise than writing your own solution:

>>> from Bio import SeqIO
>>> d = SeqIO.to_dict(SeqIO.parse('input.fa', 'fasta'))

For your data:

>>> d['piece_1']
SeqRecord(seq=Seq('Loremipsumdolorsitametconsecteturadipiscingelit.Namape...mi.', SingleLetterAlphabet()), id='piece_1', name='piece_1', description='piece_1', dbxrefs=[])
>>> str(d['piece_1'].seq)
'Loremipsumdolorsitametconsecteturadipiscingelit.Namapellentesquemi.'

Upvotes: 3

Maarten Fabré
Maarten Fabré

Reputation: 7058

you could is a collections.defaultdict

from collections import defaultdict
result = defaultdict(list)
index = None
for line in text:
    if line.startswith(">"):
        index = line[1:]
    else:
        result[index].append(line)
{
    "piece_1 ": [
        "Lorem ipsum dolor sit amet",
        "consectetur adipiscing elit. Nam a pellentesque mi. ",
    ],
    "piece_2 ": [
        "Integer dignissim ultrices eros a consequat. Praesent vestibulum"
    ],
    "piece_3 ": [
        "Morbi eget sollicitudin mauris. Nunc varius felis ",
        "vitae dui congue hendrerit. Nam semper venenatis auctor.  ",
        "Suspendisse potenti. Suspendisse facilisis velit vel convallis ",
        "fringilla. Duis condimentum auctor mauris eu lobortis.",
    ],
}

Upvotes: 1

Related Questions