BringMyCakeBack
BringMyCakeBack

Reputation: 1557

Converting Arbitrarily Nested Dict to List of Lists

I am trying to convert a nested dict data structure into a flat list of lists, and I can't come up with a good solution. Here's the data I am starting with, and the output I want to generate. What I want to do is come up with a flatten function that works regardless of how deep the input data is nested.

input_dict_1 = {"data": [
    {"gender": "male",
     "data": [
         {"age": "adult",
          "data": {"average_height": 62, "average_weight": 200}},
         {"age": "youth",
          "data": {"average_height": 50, "average_weight": 120}}]},
    {"gender": "female",
     "data": [
         {"age": "adult",
          "data": {"average_height": 55, "average_weight": 130}},
         {"age": "youth",
          "data": {"average_height": 45, "average_weight": 80}},
         {"age": "infant",
          "data": {"average_height": 15, "average_weight": 35}}]}]}

output_array_1 = flatten(input_dict_1)

# output_array_1 = [["gender", "age", "average_height", "average_weight"],
#                   ["male", "adult", 62, 200],
#                   ["male", "youth", 50, 120],
#                   ["female", "adult", 55, 130],
#                   ["female", "youth", 45, 80],
#                   ["female", "infant", 15, 35]]

input_dict_2 = {"data": [
    {"animal": "bunny",
     "data": [
         {"color": "white",
          "data": [
              {"age": "adult",
               "data": {"speed": 30, "teeth": 24}},
              {"age": "youth",
               "data": {"speed": 20, "teeth": 24}}]}]},
    {"animal": "horse",
     "data": [
         {"color": "brown",
          "data": [
              {"age": "adult",
               "data": {"speed": 120, "teeth": 6}}]}]}]}

output_array_2 = flatten(input_dict_2)

# output_array_1 = [["animal", "color", "age", "speed", "teeth"],
#                   ["bunny", "white", "adult", 30, 24],
#                   ["bunny", "white", "youth", 20, 24],
#                   ["horse", "brown", "adult", 120, 6]]

It's not too hard if you know how many levels deep the structure goes ahead of time, but I'm stuck on how to write a single function that works for arbitrarily nested input data.

There are a couple of conditions to which the input data will always conform:

  1. The number of levels will always be the same within a single input data structure. For instance, in a longer version of the input_dict_1 example there would never be a third grouping level beyond gender and age.
  2. At each level, the next subgroup is the value for a key with the name data.

There's got to be an elegant, pythonic solution for this. Any ideas?

(The context here is I'm trying to convert JSON I receive from an API into a CSV file, but this is the only hard part. Also, I know there are already a bunch of questions on SO dealing with dicts to lists, but I was not able to find one that uses a similar input/output structure.)

Upvotes: 1

Views: 1249

Answers (1)

Gabriel
Gabriel

Reputation: 10884

This is the kind of question where recursion is your friend. As a first example we can write a function to grab the column names that we'll need from one of your inputs,

input_dict_1 = {"data": [
    {"gender": "male",
     "data": [
         {"age": "adult",
          "data": {"average_height": 62, "average_weight": 200}},
         {"age": "youth",
          "data": {"average_height": 50, "average_weight": 120}}]},
    {"gender": "female",
     "data": [
         {"age": "adult",
          "data": {"average_height": 55, "average_weight": 130}},
         {"age": "youth",
          "data": {"average_height": 45, "average_weight": 80}},
         {"age": "infant",
          "data": {"average_height": 15, "average_weight": 35}}]}]}

def get_col_names( d, l=None ):
    if l==None: l=[]
    if isinstance(d['data'], list):
        l.extend( k for k in d['data'][0].keys() if k != 'data' )
        get_col_names( d['data'][0], l )
    else:
        l.extend( d['data'].keys() )
    return l

The above example was just to get you thinking about recursion. We can get the column names and visit all the key value pairs we want in one function,

def walk_dict( d, e=None, l=None, cn=None ):
    if e==None: e={}
    if l==None: l=[]
    if cn==None: cn=[]
    for k,v in d.items():
        if k != 'data':
            if k not in cn: cn.append(k)
            e[k] = v
    if isinstance(d['data'], list):
        for dp in d['data']:
            walk_dict( dp, e, l, cn )
    else:
        for k,v in d['data'].items():
            if k not in cn: cn.append(k)
            e[k] = v
        l.append( e.copy() )
    return l, cn

In the above function, e represents the current row being gathered, l is a list of all rows parsed so far, and cn is a list of the column names. Finally some code to build the list of lists you want,

rows, col_names = walk_dict( input_dict_1 )
output_array_1 = [col_names]
for d in rows:
    output_array_1.append( [d[name] for name in col_names] )

So what's going on here? e keeps track of the current row keys and values. This means it will only ever have 4 key-value pairs in it (for this input). We want "gender":"male" to persist while we go through "age":"adult" and "age":"youth". This is a perfect fit for a dictionary. Initially the entries in each row will be unordered. Once col_names is built we can use it to pull out the entries in each row in the order we like.

Why do we use e.copy() instead of just e? This brushes up against an aspect of Python that can be very confusing if you're not aware of it. If you just append e to a list you will get a reference to the object e. If you change e, the reference in the list will point to the changed e. This would result in all rows being equal to the key:value pairs for the final row. BUT by doing e.copy() we are telling Python to make a new object and append it to the list. Then we can update e without affecting what is already in our list.

col_names will be ordered according to the nesting of the input dict except for the final dictionary keys ( i.e. average_height and average_weight ) as the order in which dictionary keys are accessed is arbitrary.

Upvotes: 2

Related Questions