Dnaiel
Dnaiel

Reputation: 7832

Initialize/Create/Populate a Dict of a Dict of a Dict in Python

I have used dictionaries in python before but I am still new to python. This time I am using a dictionary of a dictionary of a dictionary... i.e., a three layer dict, and wanted to check before programming it.

I want to store all the data in this three-layer dict, and was wondering what'd be an nice pythonic way to initialize, and then read a file and write to such data structure.

The dictionary I want is of the following type:

{'geneid':
{'transcript_id':
{col_name1:col_value1, col_name2:col_value2}
}
}

The data is of this type:

geneid\ttx_id\tcolname1\tcolname2\n
hello\tNR432\t4.5\t6.7
bye\tNR439\t4.5\t6.7

Any ideas on how to do this in a good way?

Thanks!

Upvotes: 4

Views: 566

Answers (3)

ColorOutOfSpace
ColorOutOfSpace

Reputation: 73

I have to do this routinely in coding for my research. You'll want to use the defaultdict package because it lets you add key:value pairs at any level by simple assignment. I'll show you after answering your question. This is sourced directly from one of my programs. Focus on the last 4 lines (that aren't comments) and trace the variables back up through the rest of the block to see what it's doing:

from astropy.io import fits #this package handles the image data I work with
import numpy as np
import os
from collections import defaultdict

klist = ['hdr','F','Ferr','flag','lmda','sky','skyerr','tel','telerr','wco','lsf']
dtess = []

for file in os.listdir(os.getcwd()):
    if file.startswith("apVisit"):
        meff = fits.open(file, mode='readonly', ignore_missing_end=True)
        hdr = meff[0].header
        oid = str(hdr["OBJID"]) #object ID
        mjd = int(hdr["MJD5"].strip(' ')) #5-digit observation date
        for k,v in enumerate(klist):
            if k==0:
                dtess = dtess+[[oid,mjd,v,hdr]]
            else:
                dtess=dtess+[[oid,mjd,v,meff[k].data]]
        #header extension works differently from the rest of the image cube
        #it's not relevant to populating dictionaries
#HDUs in order of extension no.: header, flux, flux error, flag mask, 
# wavelength, sky flux, error in sky flux, telluric flux, telluric flux errors,
# wavelength solution coefficients, & line-spread function
dtree = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
for s,t,u,v in dtess:
    dtree[s][t][u].append(v)
#once you've added all the keys you want to your dictionary, 
#set default_factory attribute to None 
dtree.default_factory = None

Here's the digest version.

  1. First, for an n-level dictionary, you have to sort and dump everything into a list of (n+1)-tuples in the form [key_1, key_2, ... , key_n, value].
  2. Then, to initialize the n-level dictionary, you just type "defaultdict(lambda: " (minus the quotes) n-1 times, stick "defaultdict(list)" (or some other data type) at the end, and close the parentheses.
  3. Append to the list with a for loop. *Note: when you go to access data values at the lowest level, you will probably have to type my_dict[key_1][key_2] [...][key_n][0] to get actual values and not just descriptions of the data type therein.
  4. *Edit: When your dictionary is as big as you want to make it, set the default_factory attribute to None.

If you haven't set default_factory to None, you can add to your nested dictionary later by either typing something like my_dict[key_1][key_2][...][new_key]=new_value, or using an append() command. You can even add additional dictionaries as long as the ones you add by these forms of assignment aren't nested themselves.

* WARNING! The newly-added last line of that code snippet, where you set the default_factory attribute to None, is super important. Your PC needs to know when you're done adding to your dictionary, or else it may continue allocating memory in the background to prevent buffer overflow, eating up your RAM until the program grinds to a halt. This is a type of memory leak. I learned this the hard way a while after I wrote this answer. This problem plagued me for several months, and I don't even think I was the one to figure it out in the end because I didn't understand anything about memory allocation.

Upvotes: 2

Dnaiel
Dnaiel

Reputation: 7832

I was also trying to find alternatives and came up with this also great answer in stackoverflow:

What's the best way to initialize a dict of dicts in Python?

Basically in my case:

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Upvotes: 2

abarnert
abarnert

Reputation: 365925

First, let's start with the csv module to handle parsing the lines:

import csv
with open('mydata.txt', 'rb') as f:
    for row in csv.DictReader(f, delimiter='\t'):
        print row

This will print:

{'geneid': 'hello', 'tx_id': 'NR432', 'col_name1': '4.5', 'col_name2': 6.7}
{'geneid': 'bye', 'tx_id': 'NR439', 'col_name1': '4.5', 'col_name2': 6.7}

So, now you just need to reorganize that into your preferred structure. This is almost trivial, except that you have to deal with the fact that the first time you see a given geneid you have to create a new empty dict for it, and likewise for the first time you see a given tx_id within a geneid. You can solve that with setdefault:

import csv
genes = {}
with open('mydata.txt', 'rb') as f:
    for row in csv.DictReader(f, delimiter='\t'):
        gene = genes.setdefault(row['geneid'], {})
        transcript = gene.setdefault(row['tx_id'], {})
        transcript['colname1'] = row['colname1']
        transcript['colname2'] = row['colname2']

You can make this a bit more readable with defaultdict:

import csv
from collections import defaultdict
from functools import partial
genes = defaultdict(partial(defaultdict, dict))
with open('mydata.txt', 'rb') as f:
    for row in csv.DictReader(f, delimiter='\t'):
        genes[row['geneid']][row['tx_id']]['colname1'] = row['colname1']
        genes[row['geneid']][row['tx_id']]['colname2'] = row['colname2']

The trick here is that the top-level dict is a special one that returns an empty dict whenever it first sees a new key… and that empty dict it returns is itself an empty dict. The only hard part is that defaultdict takes a function that returns the right kind of object, and a function that returns a defaultdict(dict) has to be written with a partial, lambda, or explicit functions. (There are recipes on ActiveState and modules on PyPI that will give you an even more general version of this that creates new dictionaries as needed all the way down, if you want.)

Upvotes: 4

Related Questions