Reputation: 3248
I have a file with multiple lines in FASTA format, which I want to break up in pieces and populate a dictionary with these pieces.
>piece_1
Lorem ipsum dolor sit amet
consectetur adipiscing elit. Nam a pellentesque mi.
>piece_2
Integer dignissim ultrices eros a consequat. Praesent vestibulum
>piece_3
Morbi eget sollicitudin mauris. Nunc varius felis
vitae dui congue hendrerit. Nam semper venenatis auctor.
Suspendisse potenti. Suspendisse facilisis velit vel convallis
fringilla. Duis condimentum auctor mauris eu lobortis.
I want to create, from the text above, a dictionary which contains all separate pieces of text with the keys being >piece_1
etc.
So far I managed to populate a dictionary with all keys, but I can't tell how to extract the texts from the file.
f = open('Output.txt', 'r')
mydict = dict()
for index, line in enumerate(f):
if line[:1]=='>':
mydict[index] = line #instead, the key should be line with the value being the relative text.
print(line, end='')
Upvotes: 1
Views: 2126
Reputation: 650
Here's another compact possibility using list and dict comprehensions:
with open('Output.txt', 'r') as f:
s = f.read()
result = {k.strip(): v for k, v in [part.split('\n', maxsplit=1)
for part in s.split('>')[1:]] }
In the inner list comprehension: The 0th list element that s.split('>')
returns is an empty string, so we ignore it. maxsplit=1
in the subsequent split at \n
prevents splitting the text in more than 2 pieces.
Upvotes: 0
Reputation: 82785
This is one approach using a simple iteration.
Ex:
result = []
with open(filename) as infile:
for line in infile:
if line.startswith(">"): #Check if line starts with '>'
result.append([line, []]) #Create new list with format --> [key, [list of corresponding text]]
else:
result[-1][1].append(line) #Append text to previously found key.
mydict ={k: "".join(v) for k, v in result} #Form required dictionary.
print(mydict)
Output:
{'>piece_1 \n': 'Lorem ipsum dolor sit amet\nconsectetur adipiscing elit. Nam a pellentesque mi. \n',
'>piece_2 \n': 'Integer dignissim ultrices eros a consequat. Praesent vestibulum\n',
'>piece_3 \n': 'Morbi eget sollicitudin mauris. Nunc varius felis \nvitae dui congue hendrerit. Nam semper venenatis auctor. \nSuspendisse potenti. Suspendisse facilisis velit vel convallis \nfringilla. Duis condimentum auctor mauris eu lobortis. '}
Upvotes: 1
Reputation: 41188
I suggest using Biopython, it will be more robust and concise than writing your own solution:
>>> from Bio import SeqIO
>>> d = SeqIO.to_dict(SeqIO.parse('input.fa', 'fasta'))
For your data:
>>> d['piece_1']
SeqRecord(seq=Seq('Loremipsumdolorsitametconsecteturadipiscingelit.Namape...mi.', SingleLetterAlphabet()), id='piece_1', name='piece_1', description='piece_1', dbxrefs=[])
>>> str(d['piece_1'].seq)
'Loremipsumdolorsitametconsecteturadipiscingelit.Namapellentesquemi.'
Upvotes: 3
Reputation: 7058
you could is a collections.defaultdict
from collections import defaultdict
result = defaultdict(list)
index = None
for line in text:
if line.startswith(">"):
index = line[1:]
else:
result[index].append(line)
{ "piece_1 ": [ "Lorem ipsum dolor sit amet", "consectetur adipiscing elit. Nam a pellentesque mi. ", ], "piece_2 ": [ "Integer dignissim ultrices eros a consequat. Praesent vestibulum" ], "piece_3 ": [ "Morbi eget sollicitudin mauris. Nunc varius felis ", "vitae dui congue hendrerit. Nam semper venenatis auctor. ", "Suspendisse potenti. Suspendisse facilisis velit vel convallis ", "fringilla. Duis condimentum auctor mauris eu lobortis.", ], }
Upvotes: 1