Reputation: 399
I have a file that contains this information:
#chrom start end isoform
chr1 75 90 NM_100
chr1 100 120 NM_100
chr2 25 50 NM_200
chr2 55 75 NM_200
chr2 100 125 NM_200
chr2 155 200 NM_200
From this file I want to create a dictionary where the NM_
's are the keys and the starts and ends are the values. Like so:
dictionary = {NM_100: [(75, 90), (100,120)], NM_200: [(25, 50), (55,75), (100, 125), (155, 200)]}
I've been trying to use this code to generate a function that will allow me to zip the starts and ends, but I can't seem to get it to work properly.
def read_exons(line):
parts = iter(line.split())
chrom = next(parts)
start = next(parts)
end = next(parts)
isoform = next(parts)
return isoform, [(s, e) for s, e in zip(start, end)]
with open('test_coding.txt') as f:
exons = dict(read_exons(line) for line in f
if not line.strip().startswith('#'))
I understand that the function will not allow me to append to the values, but I'm struggling to figure out how to even get the start and end for one line to appear properly in the dictionary. Any ideas? Is it a problem with the iter()
or zip
?
Upvotes: 1
Views: 49
Reputation: 179392
collections.defaultdict
might help:
import collections
exons = collections.defaultdict(list)
for line in f:
chrom, start, end, isoform = line.split()
exons[isoform].append((int(start), int(end)))
Simple!
This takes advantage of a few things:
iter()
solution you have above. In general, tuple unpacking is simpler and easier to read.It uses collections.defaultdict
to effectively make every key map to an empty list (initially), which saves you from having to check if each key is mapped. Without defaultdict, you'd do
exons = {}
...
if isoform not in exons:
exons[isoform] = []
exons[isoform].append(...)
Upvotes: 1