Reputation: 71
I am still working on a problem to flatten a nested JSON file. The nested items are either List or Dict:
Here is the file I want to flatten (Unlike in my previous post, I kept it at good length, but it only contains input[0] not any subsequent items as it will be very long):
input = [{'states': ['USED'], 'niceName': '1-series', 'id': 'BMW_1_Series',
'years': [{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 100994560},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 100974974},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 100974975},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 100994561}
],
'states': ['USED'], 'id': 100524709, 'year': 2008},
{'styles':
[{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101082656},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101082655},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101082663},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101082662}
],
'states': ['USED'], 'id': 100503222, 'year': 2009},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101200599},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101200600},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101200607},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101200601}
],
'states': ['USED'], 'id': 100529091, 'year': 2010},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101288165},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101288166},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101288298},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101288297}
],
'states': ['USED'], 'id': 100531309, 'year': 2011},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101381667},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101381668},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101381665},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101381666}
],
'states': ['USED'], 'id': 100534729, 'year': 2012},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428722},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428721},
{'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135is 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200421701},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200428724},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428723},
{'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i SULEV 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428726},
{'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i SULEV 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428725},
{'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135is 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428727}
],
'states': ['USED'], 'id': 200421700, 'year': 2013}
],
'name': '1 Series', 'make': {'niceName': 'bmw', 'name': 'BMW', 'id': 200000081}
}, #here is more to come, but I needed to crop it
]
The code I used so far after failing with my aproach was written by @poke from: Flattening Generic JSON List of Dicts or Lists in Python
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
del new[k]
return new, k, v
# or just one subobject
elif isinstance(v, dict):
del new[k]
return new, k, [v]
return new, None, None
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
# just return fully flat objects
if key is None:
yield flat
continue
# otherwise recursively flatten the subobjects
for sub in flatten(subs, key):
sub.update(flat)
yield sub
I receive the following error:
AttributeError: 'str' object has no attribute 'items'
Which results from 'states': ['USED']
I do not know how to handle that. The key 'states' can be kept as a list.
I hope that somebody can help me out on that.
Ps: This is a follow up post from Python: Write Nested JSON as multiple elements in List
Upvotes: 1
Views: 3974
Reputation: 71
Here is my solution for splitObj
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
obj needs to be a Dictonary
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None or prefix=="NotFlat" else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
cL = 0
cD = 0
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
#Determine the number of lists in v
if isinstance(v, list):
cL += 1
#Determine the number of dict in v
elif isinstance(v, dict):
cD += 1
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
if (cD+cL) <=1:
try:
type(v[0])
except IndexError:
v = [""]
if not isinstance(v[0], str):
del new[k]
return new, k, v
elif isinstance(v[0], str):
#handle list when only containing strings, return, the whole thing
#solve other dicts which might be in the line
#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return new, None, None
else:
custLog.logger.info("")
elif (cD+cL) >1:
#print("Count List2 CD: "+str(cD))
#print("Count LIST2 CL: "+str(cL))
#if list is empty
try:
type(v[0])
except IndexError:
v = [""]
if not isinstance(v[0], str):
del new[k]
for x in flatten([new]):
newOut = x
break
return newOut, k, v
elif isinstance(v[0], str):
#handle list when only containing strings, return, the whole thing
#solve other dicts which might be in the line
#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return None, "NotFlat", [new]
else:
custLog.logger.error("weder noch 2")
# or just one subobject
elif isinstance(v, dict):
if (cD+cL) <=1:
del new[k]
return new, k, [v]
elif (cD+cL) >1:
del new[k]
for x in flatten([new]):
newOut = x
break
return newOut, k, [v]
return new, None, None
and here for flatten
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
if subs is None:
if key is None:
yield flat
continue
# just return fully flat objects
if key is None and flat is not None:
yield flat
continue
# otherwise recursively flatten the subobjects
try:
for sub in flatten(subs, key):
if flat is not None:
sub.update(flat)
yield sub
except TypeError as e:
custLog.logger.error("ERR: TypeError"+str(e))
Upvotes: 1
Reputation: 166
It is often easier to find a solution for a more general problem. So, let's look closer at the problem first.
The input is some JSON file describing a set of objects.
An object is recusively defined as either an atom (strings or numbers) or a dict with object values. Lists are used to represent alternatives (i.e. any element of the list can take the place of the list).
For example {a:[1,2]}
means that a
can be either 1
or 2
.
The output should be a list of objects that do not contain any options. Additionally, the objects should be flattened, i.e. should be be dicts whose values are atoms and whose keys are describing the path to the value in the original object.
My solution handles alternatives and flattening separately.
The function normalise
below takes the input of json.dumps
and yields a sequence of dicts. Note that the input and output of normalise
have the same semantics and describe the same set of objects. The output is just normalised in the sense that it does contain alternatives only at the top level. Database people would call that denormalised because it is undesirable for relational databases.
normalise
always returns a sequence of objects. normalise
is implemented as generator to keep memory usage low.
The following cases are distinguished in normalise
.
Here is the code:
import itertools
def normalise(x):
if isinstance(x, dict):
keys = x.keys()
values = (normalise(i) for i in x.values())
for i in itertools.product(*values):
yield (dict(zip(keys, i)))
elif isinstance(x, list):
#if not x: # uncomment for "LEFT JOIN" behaviour
# yield None
for i in x:
yield from normalise(i)
else:
yield x
This code does not return an object if it contains any empty lists. This is because there is no possible value. This is like SQL "INNER JOIN". From Bert's answer it looks like he wants "LEFT JOIN" behaviour (i.e. some default value). To achieve that just uncomment the two lines.
The objects yielded by normalise
still have the original (nested) dict structure. They could be flattened using code found in other discussions.
However, the OP wants to insert the objects in a database. So, he will most likely not need a list of keys of the flattened dictionary. He will only need a function returning the value for a given path.
This can be achieved by creating a wrapper object to the dict that has a __getitem__
method. This wrapper can also be used to return a default value for non-existing paths.
class DictWrapper:
def __init__(self, d, sep='.'):
self.d = d
self.sep = sep
def __getitem__(self, key):
ret = self.d
try:
for k in key.split(self.sep):
ret = ret[k]
return ret
except KeyError:
return None
The sql insert can than look like the following (tested with psycopg2)
for i in normalise(input):
cur.execute('insert into mytable (year) VALUES (%(years.year)s)', DictWrapper(i))
This implemenation clearly sacrifices some runtime performance for clarity.
Abstract base classes could be used instead of list
and dict
. However, this might be problematic because str
is a sequence but should be treated as atom.
DictWrapper
only works properly if sep
is not contained in any of the dict keys.
normalise
does not filter out duplicates. This could be done by using sets and named tuples instead of lists and dicts. However, this would imply that the whole result has to be in memory. It might be better to filter out duplicates at the database level.
To keep memory usage at a minimum the JSON should be read lazily.
Upvotes: 1
Reputation: 107652
While not a generalizable function, consider walking through each nested element for a flat output for database import or flatfile (csv, txt) export. Since json files consist of combinations of dictionaries and lists, handle them accordingly at each level:
items = []
for outer in data:
inner = [''] * 15
for outerk, outerv in outer.items():
inner[0] = outer['states'][0]
inner[1] = outer['niceName']
inner[2] = outer['id']
inner[3] = outer['make']['niceName']
inner[4] = outer['make']['name']
inner[5] = outer['make']['id']
if outerk == 'years':
for yri in outer[outerk]:
for yrk, yrv in yri.items():
inner[6] = yri['states'][0]
inner[7] = yri['id']
inner[8] = yri['year']
if yrk == 'styles':
for stylei in yri[yrk]:
inner[9] = stylei['trim']
inner[10] = stylei['name']
inner[11] = stylei['id']
inner[12] = stylei['submodel']['body']
inner[13] = stylei['submodel']['niceName']
inner[14] = stylei['submodel']['modelName']
items.append(inner[0:14])
for i in items:
print(i)
Output (where parent items repeat for each child)
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Convertible (3.0L 6cyl 6M)', 100994560, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Coupe (3.0L 6cyl 6M)', 100974974, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 100974975, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 100994561, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100503222, 2009, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 101082656, 'Coupe', 'coupe']
Upvotes: 0