Reputation: 71

Flatten nested JSON (Dict, List) into List to prepare to write into DB

I am still working on a problem to flatten a nested JSON file. The nested items are either List or Dict:

Here is the file I want to flatten (Unlike in my previous post, I kept it at good length, but it only contains input[0] not any subsequent items as it will be very long):

input = [{'states': ['USED'], 'niceName': '1-series', 'id': 'BMW_1_Series',
            'years': [{'styles':
                       [{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
                         'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 100994560},
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
                          'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 100974974},
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 100974975},
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 100994561}
                        ],
                       'states': ['USED'], 'id': 100524709, 'year': 2008},
                      {'styles':
                       [{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101082656}, 
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101082655},
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101082663},
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101082662}
                        ], 
                       'states': ['USED'], 'id': 100503222, 'year': 2009},
                      {'styles': 
                       [{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101200599},
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101200600}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101200607}, 
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101200601}
                        ], 
                       'states': ['USED'], 'id': 100529091, 'year': 2010}, 
                      {'styles':
                       [{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101288165}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101288166}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101288298}, 
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101288297}
                        ], 
                       'states': ['USED'], 'id': 100531309, 'year': 2011}, 
                      {'styles': 
                       [{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101381667}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101381668}, 
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101381665}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101381666}
                        ], 
                       'states': ['USED'], 'id': 100534729, 'year': 2012}, 
                      {'styles': 
                       [{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                        'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428722},
                        {'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428721}, 
                        {'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135is 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200421701}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200428724}, 
                        {'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428723}, 
                        {'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'}, 
                         'name': '128i SULEV 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428726}, 
                        {'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '128i SULEV 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428725}, 
                        {'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'}, 
                         'name': '135is 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428727}
                        ], 
                       'states': ['USED'], 'id': 200421700, 'year': 2013}
                      ], 
          'name': '1 Series', 'make': {'niceName': 'bmw', 'name': 'BMW', 'id': 200000081}
          }, #here is more to come, but I needed to crop it
          ]

The code I used so far after failing with my aproach was written by @poke from: Flattening Generic JSON List of Dicts or Lists in Python

def splitObj (obj, prefix = None):
    '''
    Split the object, returning a 3-tuple with the flat object, optionally
    followed by the key for the subobjects and a list of those subobjects.
    '''
    # copy the object, optionally add the prefix before each key
    new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

    # try to find the key holding the subobject or a list of subobjects
    for k, v in new.items():
        # list of subobjects
        if isinstance(v, list):
            del new[k]
            return new, k, v
        # or just one subobject
        elif isinstance(v, dict):
            del new[k]
            return new, k, [v]
    return new, None, None

def flatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''
    # iterate all items
    for item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)

        # just return fully flat objects
        if key is None:
            yield flat
            continue

        # otherwise recursively flatten the subobjects
        for sub in flatten(subs, key):
            sub.update(flat)
            yield sub

I receive the following error:

AttributeError: 'str' object has no attribute 'items'

Which results from 'states': ['USED']

I do not know how to handle that. The key 'states' can be kept as a list.

I hope that somebody can help me out on that.

Ps: This is a follow up post from Python: Write Nested JSON as multiple elements in List

Upvotes: 1

Answers (3)

Bert Maier

Reputation: 71

Here is my solution for splitObj

def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
obj needs to be a Dictonary
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None or prefix=="NotFlat" else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

cL = 0
cD = 0
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
    #Determine the number of lists in v
    if isinstance(v, list):
        cL += 1
    #Determine the number of dict in v
    elif isinstance(v, dict):
        cD += 1     
for k, v in new.items():
    # list of subobjects
    if isinstance(v, list):
        if (cD+cL) <=1:
            try:
                type(v[0])
            except IndexError:
                v = [""]
            if not isinstance(v[0], str):
                del new[k]
                return new, k, v
            elif isinstance(v[0], str):
                #handle list when only containing strings, return, the whole thing
                #solve other dicts which might be in the line
                #use "NotFlat" to run loop again but without adding a prefix

                new[k] = ", ".join(v)
                return new, None, None
            else:
                custLog.logger.info("")
        elif (cD+cL) >1:

            #print("Count List2 CD: "+str(cD))
            #print("Count LIST2 CL: "+str(cL))

            #if list is empty
            try:
                type(v[0])
            except IndexError:
                v = [""]

            if not isinstance(v[0], str):
                del new[k]
                for x in flatten([new]):
                    newOut = x
                    break
                return newOut, k, v
            elif isinstance(v[0], str):
                #handle list when only containing strings, return, the whole thing
                #solve other dicts which might be in the line
                #use "NotFlat" to run loop again but without adding a prefix
                new[k] = ", ".join(v)
                return None, "NotFlat", [new]
            else:
                custLog.logger.error("weder noch 2")

    # or just one subobject
    elif isinstance(v, dict):
        if (cD+cL) <=1:
            del new[k]
            return new, k, [v]
        elif (cD+cL) >1:
            del new[k]
            for x in flatten([new]):
                newOut = x
                break
            return newOut, k, [v]
return new, None, None

and here for flatten

def flatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''
    # iterate all items


    for item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)
        if subs is None:
            if key is None:
                yield flat
                continue    
        # just return fully flat objects
        if key is None and flat is not None:
            yield flat
            continue

        # otherwise recursively flatten the subobjects
        try:
            for sub in flatten(subs, key):
                if flat is not None:
                    sub.update(flat)
                yield sub
        except TypeError as e:
            custLog.logger.error("ERR: TypeError"+str(e))

Upvotes: 1

Hinni

Reputation: 166

Rethinking the problem

It is often easier to find a solution for a more general problem. So, let's look closer at the problem first.

The input is some JSON file describing a set of objects.

An object is recusively defined as either an atom (strings or numbers) or a dict with object values. Lists are used to represent alternatives (i.e. any element of the list can take the place of the list). For example {a:[1,2]} means that a can be either 1 or 2.

The output should be a list of objects that do not contain any options. Additionally, the objects should be flattened, i.e. should be be dicts whose values are atoms and whose keys are describing the path to the value in the original object.

My solution handles alternatives and flattening separately.

Normalising

The function normalise below takes the input of json.dumps and yields a sequence of dicts. Note that the input and output of normalise have the same semantics and describe the same set of objects. The output is just normalised in the sense that it does contain alternatives only at the top level. Database people would call that denormalised because it is undesirable for relational databases.

normalise always returns a sequence of objects. normalise is implemented as generator to keep memory usage low.

The following cases are distinguished in normalise.

An atom input means there is only one possibility. So, the atom is yielded (this is like returning a list containing the atom).
A list means alternatives of alternatives. It yields all elements of its normalised input (this is like concatenating lists).
A dict means that we have to consider all combinations of alternatives for the individual keys. So, we return a cartesian product of the alternatives.

Here is the code:

import itertools

def normalise(x):
    if isinstance(x, dict):
        keys = x.keys()
        values = (normalise(i) for i in x.values())
        for i in itertools.product(*values):
            yield (dict(zip(keys, i)))
    elif isinstance(x, list):
        #if not x:           # uncomment for "LEFT JOIN" behaviour
        #    yield None
        for i in x:
            yield from normalise(i)
    else:
        yield x

This code does not return an object if it contains any empty lists. This is because there is no possible value. This is like SQL "INNER JOIN". From Bert's answer it looks like he wants "LEFT JOIN" behaviour (i.e. some default value). To achieve that just uncomment the two lines.

Pseudo-flattening

The objects yielded by normalise still have the original (nested) dict structure. They could be flattened using code found in other discussions.

However, the OP wants to insert the objects in a database. So, he will most likely not need a list of keys of the flattened dictionary. He will only need a function returning the value for a given path.

This can be achieved by creating a wrapper object to the dict that has a __getitem__ method. This wrapper can also be used to return a default value for non-existing paths.

class DictWrapper:
    def __init__(self, d, sep='.'):
        self.d = d
        self.sep = sep

    def __getitem__(self, key):
        ret = self.d
        try:
            for k in key.split(self.sep):
                ret = ret[k]
            return ret
        except KeyError:
            return None

The sql insert can than look like the following (tested with psycopg2)

for i in normalise(input):
    cur.execute('insert into mytable (year) VALUES (%(years.year)s)', DictWrapper(i))

Implementation details

This implemenation clearly sacrifices some runtime performance for clarity.
Abstract base classes could be used instead of list and dict. However, this might be problematic because str is a sequence but should be treated as atom.
DictWrapper only works properly if sep is not contained in any of the dict keys.
normalise does not filter out duplicates. This could be done by using sets and named tuples instead of lists and dicts. However, this would imply that the whole result has to be in memory. It might be better to filter out duplicates at the database level.
To keep memory usage at a minimum the JSON should be read lazily.

Upvotes: 1

Parfait

Reputation: 107652

While not a generalizable function, consider walking through each nested element for a flat output for database import or flatfile (csv, txt) export. Since json files consist of combinations of dictionaries and lists, handle them accordingly at each level:

items = []
for outer in data:    
    inner = [''] * 15    
    for outerk, outerv in outer.items():        
        inner[0] = outer['states'][0]
        inner[1] = outer['niceName']
        inner[2] = outer['id']
        inner[3] = outer['make']['niceName']
        inner[4] = outer['make']['name']
        inner[5] = outer['make']['id']    
        if outerk == 'years':            
            for yri in outer[outerk]:                
                for yrk, yrv in yri.items():
                    inner[6] = yri['states'][0] 
                    inner[7] = yri['id'] 
                    inner[8] = yri['year'] 
                    if yrk == 'styles':
                        for stylei in yri[yrk]:
                            inner[9] = stylei['trim']
                            inner[10] = stylei['name']
                            inner[11] = stylei['id']
                            inner[12] = stylei['submodel']['body']
                            inner[13] = stylei['submodel']['niceName']
                            inner[14] = stylei['submodel']['modelName']

                            items.append(inner[0:14])

for i in items:        
    print(i)

Output (where parent items repeat for each child)

# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Convertible (3.0L 6cyl 6M)', 100994560, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Coupe (3.0L 6cyl 6M)', 100974974, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 100974975, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 100994561, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100503222, 2009, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 101082656, 'Coupe', 'coupe']

Upvotes: 0

Flatten nested JSON (Dict, List) into List to prepare to write into DB

Answers (3)

Rethinking the problem

Normalising

Pseudo-flattening

Implementation details

Related Questions