Reputation: 7832
I have used dictionaries in python before but I am still new to python. This time I am using a dictionary of a dictionary of a dictionary... i.e., a three layer dict, and wanted to check before programming it.
I want to store all the data in this three-layer dict, and was wondering what'd be an nice pythonic way to initialize, and then read a file and write to such data structure.
The dictionary I want is of the following type:
{'geneid':
{'transcript_id':
{col_name1:col_value1, col_name2:col_value2}
}
}
The data is of this type:
geneid\ttx_id\tcolname1\tcolname2\n
hello\tNR432\t4.5\t6.7
bye\tNR439\t4.5\t6.7
Any ideas on how to do this in a good way?
Thanks!
Upvotes: 4
Views: 566
Reputation: 73
I have to do this routinely in coding for my research. You'll want to use the defaultdict package because it lets you add key:value pairs at any level by simple assignment. I'll show you after answering your question. This is sourced directly from one of my programs. Focus on the last 4 lines (that aren't comments) and trace the variables back up through the rest of the block to see what it's doing:
from astropy.io import fits #this package handles the image data I work with
import numpy as np
import os
from collections import defaultdict
klist = ['hdr','F','Ferr','flag','lmda','sky','skyerr','tel','telerr','wco','lsf']
dtess = []
for file in os.listdir(os.getcwd()):
if file.startswith("apVisit"):
meff = fits.open(file, mode='readonly', ignore_missing_end=True)
hdr = meff[0].header
oid = str(hdr["OBJID"]) #object ID
mjd = int(hdr["MJD5"].strip(' ')) #5-digit observation date
for k,v in enumerate(klist):
if k==0:
dtess = dtess+[[oid,mjd,v,hdr]]
else:
dtess=dtess+[[oid,mjd,v,meff[k].data]]
#header extension works differently from the rest of the image cube
#it's not relevant to populating dictionaries
#HDUs in order of extension no.: header, flux, flux error, flag mask,
# wavelength, sky flux, error in sky flux, telluric flux, telluric flux errors,
# wavelength solution coefficients, & line-spread function
dtree = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
for s,t,u,v in dtess:
dtree[s][t][u].append(v)
#once you've added all the keys you want to your dictionary,
#set default_factory attribute to None
dtree.default_factory = None
Here's the digest version.
If you haven't set default_factory to None, you can add to your nested dictionary later by either typing something like my_dict[key_1][key_2][...][new_key]=new_value, or using an append() command. You can even add additional dictionaries as long as the ones you add by these forms of assignment aren't nested themselves.
* WARNING! The newly-added last line of that code snippet, where you set the default_factory attribute to None, is super important. Your PC needs to know when you're done adding to your dictionary, or else it may continue allocating memory in the background to prevent buffer overflow, eating up your RAM until the program grinds to a halt. This is a type of memory leak. I learned this the hard way a while after I wrote this answer. This problem plagued me for several months, and I don't even think I was the one to figure it out in the end because I didn't understand anything about memory allocation.
Upvotes: 2
Reputation: 7832
I was also trying to find alternatives and came up with this also great answer in stackoverflow:
What's the best way to initialize a dict of dicts in Python?
Basically in my case:
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
Upvotes: 2
Reputation: 365925
First, let's start with the csv
module to handle parsing the lines:
import csv
with open('mydata.txt', 'rb') as f:
for row in csv.DictReader(f, delimiter='\t'):
print row
This will print:
{'geneid': 'hello', 'tx_id': 'NR432', 'col_name1': '4.5', 'col_name2': 6.7}
{'geneid': 'bye', 'tx_id': 'NR439', 'col_name1': '4.5', 'col_name2': 6.7}
So, now you just need to reorganize that into your preferred structure. This is almost trivial, except that you have to deal with the fact that the first time you see a given geneid
you have to create a new empty dict
for it, and likewise for the first time you see a given tx_id
within a geneid
. You can solve that with setdefault
:
import csv
genes = {}
with open('mydata.txt', 'rb') as f:
for row in csv.DictReader(f, delimiter='\t'):
gene = genes.setdefault(row['geneid'], {})
transcript = gene.setdefault(row['tx_id'], {})
transcript['colname1'] = row['colname1']
transcript['colname2'] = row['colname2']
You can make this a bit more readable with defaultdict
:
import csv
from collections import defaultdict
from functools import partial
genes = defaultdict(partial(defaultdict, dict))
with open('mydata.txt', 'rb') as f:
for row in csv.DictReader(f, delimiter='\t'):
genes[row['geneid']][row['tx_id']]['colname1'] = row['colname1']
genes[row['geneid']][row['tx_id']]['colname2'] = row['colname2']
The trick here is that the top-level dict
is a special one that returns an empty dict
whenever it first sees a new key… and that empty dict
it returns is itself an empty dict
. The only hard part is that defaultdict
takes a function that returns the right kind of object, and a function that returns a defaultdict(dict)
has to be written with a partial
, lambda
, or explicit functions. (There are recipes on ActiveState and modules on PyPI that will give you an even more general version of this that creates new dictionaries as needed all the way down, if you want.)
Upvotes: 4