interstellar
interstellar

Reputation: 399

Creating dictionary from file by defining a function

I have a file that contains this information:

#chrom    start    end    isoform
chr1    75  90  NM_100
chr1    100 120 NM_100
chr2    25  50  NM_200
chr2    55  75  NM_200
chr2    100 125 NM_200
chr2    155 200 NM_200

From this file I want to create a dictionary where the NM_'s are the keys and the starts and ends are the values. Like so:

dictionary = {NM_100: [(75, 90), (100,120)], NM_200: [(25, 50), (55,75), (100, 125), (155, 200)]}

I've been trying to use this code to generate a function that will allow me to zip the starts and ends, but I can't seem to get it to work properly.

def read_exons(line):
    parts = iter(line.split())
    chrom = next(parts)
    start = next(parts)
    end = next(parts)
    isoform = next(parts)
    return isoform, [(s, e) for s, e in zip(start, end)]

with open('test_coding.txt') as f:
    exons = dict(read_exons(line) for line in f
        if not line.strip().startswith('#'))

I understand that the function will not allow me to append to the values, but I'm struggling to figure out how to even get the start and end for one line to appear properly in the dictionary. Any ideas? Is it a problem with the iter() or zip?

Upvotes: 1

Views: 49

Answers (1)

nneonneo
nneonneo

Reputation: 179392

collections.defaultdict might help:

import collections

exons = collections.defaultdict(list)
for line in f:
    chrom, start, end, isoform = line.split()
    exons[isoform].append((int(start), int(end)))

Simple!


This takes advantage of a few things:

  • It unpacks the line columns using tuple unpacking, instead of the iter() solution you have above. In general, tuple unpacking is simpler and easier to read.
  • It builds the dictionary incrementally, instead of trying to do it all at once as your current solution attempts (note that you can't gather all the start/end pairs at once if you are processing the data line-by-line!)
  • It uses collections.defaultdict to effectively make every key map to an empty list (initially), which saves you from having to check if each key is mapped. Without defaultdict, you'd do

    exons = {}
    ...
        if isoform not in exons:
            exons[isoform] = []
        exons[isoform].append(...)
    

Upvotes: 1

Related Questions