Flatten a nested dict structure into a dataset

Question

For some post-processing, I need to flatten a structure like this

{'foo': {
          'cat': {'name': 'Hodor',  'age': 7},
          'dog': {'name': 'Mordor', 'age': 5}},
 'bar': { 'rat': {'name': 'Izidor', 'age': 3}}
}

into this dataset:

[{'foobar': 'foo', 'animal': 'dog', 'name': 'Mordor', 'age': 5},
 {'foobar': 'foo', 'animal': 'cat', 'name': 'Hodor',  'age': 7},
 {'foobar': 'bar', 'animal': 'rat', 'name': 'Izidor', 'age': 3}]

So I wrote this function:

def flatten(data, primary_keys):
    out = []
    keys = copy.copy(primary_keys)
    keys.reverse()
    def visit(node, primary_values, prim):
        if len(prim):
            p = prim.pop()
            for key, child in node.iteritems():
                primary_values[p] = key
                visit(child, primary_values, copy.copy(prim))
        else:
            new = copy.copy(node)
            new.update(primary_values)
            out.append(new)
    visit(data, { }, keys)
    return out

out = flatten(a, ['foo', 'bar'])

I was not really satisfied because I have to use copy.copy to protect my inputs. Obviously, when using flatten one does not want the inputs be altered.

Then I thought about one alternative that uses more global variables (at least global to flatten) and uses an index instead of directly passing primary_keys to visit. However, this does not really help me to get rid of the ugly initial copy:

    keys = copy.copy(primary_keys)
    keys.reverse()

So here is my final version:

def flatten(data, keys):
    data = copy.copy(data)
    keys = copy.copy(keys)
    keys.reverse()
    out = []
    values = {}
    def visit(node, id):
        if id:
            id -= 1
            for key, child in node.iteritems():
               values[keys[id]] = key
               visit(child, id)
        else:
            node.update(values)
            out.append(node)
    visit(data, len(keys))
    return out

Is there a better implementation (that can avoid the use of copy.copy)?

IanS · Accepted Answer

Edit: modified to account for variable dictionary depth.

By using the merge function from my previous answer (below), you can avoid calling update which modifies the caller. There is then no need to copy the dictionary first.

def flatten(data, keys):
    out = []
    values = {}
    def visit(node, id):
        if id:
            id -= 1
            for key, child in node.items():
               values[keys[id]] = key
               visit(child, id)
        else:
            out.append(merge(node, values))  # use merge instead of update
    visit(data, len(keys))
    return out

One thing I don't understand is why you need to protect the keys input. I don't see them being modified anywhere.

Previous answer

How about list comprehension?

def merge(d1, d2):
    return dict(list(d1.items()) + list(d2.items()))

[[merge({'foobar': key, 'animal': sub_key}, sub_sub_dict) 
    for sub_key, sub_sub_dict in sub_dict.items()] 
        for key, sub_dict in a.items()]

The tricky part was merging the dictionaries without using update (which returns None).

Flatten a nested dict structure into a dataset

Answers (1)

Related Questions