Python defaultdict deep nested data structure

Question

I have a list of excel datasets with certain information as below:

Category    Subcategory    Name
Main Dish   Noodle         Tomato Noodle
Main Dish   Stir Fry       Chicken Rice
Main Dish   Soup           Beef Goulash
Drink       Wine           Bordeaux
Drink       Softdrink      Cola

Suppose the above dataset is only one of the datasets, my desired data structure using nested dict and list is:

data = {0:{'data':0, 'Category':[
                                 {'name':'Main Dish', 'Subcategory':[
                                                       {'name':'Noodle', 'key':0, 'data':['key':1, 'title':'Tomato Noodle']},
                                                       {'name':'Stir Fry', 'key':1, 'data':['key':2, 'title':'Chicken Rice']},
                                                       {'name':'Soup', 'key':2, 'data':['key':3, 'title':'Beef Goulash']}]},
                                  {'name':'Drink', 'Subcategory':[
                                                       {'name':'Wine', 'key':0, 'data':['key':1, 'title':'Bordeaux']},
                                                       {'name':'Softdrink', 'key':1, 'data':['key':2, 'title':'cola'}]}]},
        1:{'data':1, 'Category':.........#Same structure as dataset 0}}

So basically, the whole category is a defaultdict(list), each different categories form a dict within the whole category list. So do the different subcategories, but subcategories follow category.

I tried to use the defaultdict to do it, here are my codes:

from collections import defaultdict
data = defaultdict(dict)
cateList = ["Main Dish", "Drink"]
n = 3 # n means the number of datasets
for i in range(n):
    data[i]['data'] = i
    data[i]['category'] = defaultdict(list) 
    for j in range(len(cateList)):
        data[i]['category'][j]['name'] = cateList[j]
        data[i]['category'][j]['subcategory'] = defaultdict(list)
data

But I receive the following errors:

TypeError                                 Traceback (most recent call last)
 in ()
      5     data[i]['category'] = defaultdict(list)
      6     for j in range(len(cateList)):
----> 7         c
      8         data[i]['category'][j]['subcategory'] = defaultdict(list)
      9 data

TypeError: list indices must be integers or slices, not str

This is executed in Jupyter Notebook, and it seems that it doesn't allow me to indicate the nested defaultdict in this way: data[i]['category'][j]['name'] = cateList[j]. So I am not quite sure how construct the above data structure...is there a better way?

Thank you very much for your help.

Martijn Pieters · Accepted Answer

Your spec states you wanted 'Category' to reference a list:

data = {0:{'data':0, 'Category':[
#                               ^ a list opening bracket

but instead, your code makes it a dictionary:

data[i]['category'] = defaultdict(list)

but the remainder of your code then attempts to treat the 'category' object as a list again, by using j as an index. Because it's a dictionary instead, the expression data[i]['category'][j] produces a list, and data[i]['category'][j]['name'] or data[i]['category'][j]['subcategory'] tries to index that list with a string.

Building this structure really doesn't require a defaultdict; you already know you want to build data, and you are building the nested structures right there with loops. You can just use regular dictionaries and lists:

cateList = ["Main Dish", "Drink"]
n = 3 # n means the number of datasets

data = {}
for i in range(n):
    data[i] = {
        'data': i,
        'category': []
    }
    category = data[i]['category']
    for name in cateList:
        category.append({
            'name': name,
            'subcategory': []
        })

I'm not quite sure why you are building an outer dictionary with integer keys starting at 0. You could just make that a list too.

Python defaultdict deep nested data structure

Answers (1)

Related Questions