Reputation: 2495
I have a JSON which I converted into a dictionary and trying to make a dataframe out of it. the problem is that it is multiple nested and with inconsistent data
For e.g.
d = """[
{
"id": 51,
"kits": [
{
"id": 57,
"kit": "KIT1182A",
"items": [
{
"id": 254,
"product": {
"name": "Plastic Pallet",
"short_code": "PP001",
"priceperunit": 2500,
"volumetric_weight": 21.34
},
"quantity": 5
},
{
"id": 258,
"product": {
"name": "Separator Sheet",
"short_code": "FSS001",
"priceperunit": 170,
"volumetric_weight": 0.9
},
"quantity": 18
}
],
"quantity": 5
}, #end of kit
{
"id": 58,
"kit": "KIT1182B",
"items": [
{
"id": 259,
"product": {
"name": "Plastic Pallet",
"short_code": "PP001",
"priceperunit": 2500,
"volumetric_weight": 21.34
},
"quantity": 5
},
{
"id": 260,
"product": {
"name": "Plastic Sidewall",
"short_code": "PS001",
"priceperunit": 1250,
"volumetric_weight": 16.1
},
"quantity": 5
},
{
"id": 261,
"product": {
"name": "Plastic Lid",
"short_code": "PL001",
"priceperunit": 1250,
"volumetric_weight": 9.7
},
"quantity": 5
}
],
"quantity": 7
} #end of kit
],
"warehouse": "Yantraksh Logistics Private limited_GGNPC1",
"receiver_client": "Lumax Cornaglia Auto Tech Private Limited",
"transport_by": "Kiran Roadways",
"transaction_type": "Return",
"transaction_date": "2020-08-13T04:34:11.678000Z",
"transaction_no": 1180,
"is_delivered": false,
"driver_name": "__________",
"driver_number": "__________",
"lr_number": 0,
"vehicle_number": "__________",
"freight_charges": 0,
"vehicle_type": "Part Load",
"remarks": "0",
"flow": 36,
"owner": 2
} ]"""
I want to convert it into a dataframe like the following:
transaction_no is_delivered flow transaction_date receiver_client warehouse kits quantity product1 quantity1 product2 quantity2 product3 quantity3
1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 KIT1182A 5 PP001 5 FSS001 18 NaN NaN
1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 KIT1182B 7 PP001 5 PS001 5 PL001 7.0
or to show it in a better way:
What I have done:
data = json.loads(d)
result_dataframe = pd.DataFrame(data)
l = ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse','kits'] #fields that I need
result_dataframe = result_dataframe[l]
result_dataframe.to_csv("out.csv")
I tried :
def flatten(input_dict, separator='_', prefix=''):
output_dict = {}
for key, value in input_dict.items():
if isinstance(value, dict) and value:
deeper = flatten(value, separator, prefix+key+separator)
output_dict.update({key2: val2 for key2, val2 in deeper.items()})
elif isinstance(value, list) and value:
for index, sublist in enumerate(value, start=1):
if isinstance(sublist, dict) and sublist:
deeper = flatten(sublist, separator, prefix+key+separator+str(index)+separator)
output_dict.update({key2: val2 for key2, val2 in deeper.items()})
else:
output_dict[prefix+key+separator+str(index)] = value
else:
output_dict[prefix+key] = value
return output_dict
But it gives all the values in a single row, how can I seprate them on the basis of kits and get the result?
Upvotes: 1
Views: 294
Reputation: 13646
Data transformations like above are very common. Pandas provide lots of tools to help you in this task.
data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse']) # line 1
df = df.explode('items') # line 2
df[['product_code', 'product_quantity']] = df['items'].apply(lambda x: pd.Series([x['product']['short_code'], x['quantity']])) # line 3
df.drop(columns=['items']) # line 4
will provide you with
id kit quantity transaction_no is_delivered flow transaction_date receiver_client warehouse product_code product_quantity
0 57 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5
0 57 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 FSS001 18
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PS001 5
1 58 KIT1182B 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PL001 5
Actually, the trick is only in the pd.json_normalize
(line 1). It will create a dataframe very similar to what you've asked for:
id kit items quantity transaction_no is_delivered flow transaction_date receiver_client warehouse
0 57 KIT1182A [{'id': 254, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 258, 'product': {'name': 'Separator Sheet', 'short_code': 'FSS001', 'priceperunit': 170, 'volumetric_weight': 0.9}, 'quantity': 18}] 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1
1 58 KIT1182B [{'id': 259, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 260, 'product': {'name': 'Plastic Sidewall', 'short_code': 'PS001', 'priceperunit': 1250, 'volumetric_weight': 16.1}, 'quantity': 5}, {'id': 261, 'product': {'name': 'Plastic Lid', 'short_code': 'PL001', 'priceperunit': 1250, 'volumetric_weight': 9.7}, 'quantity': 5}] 7 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1
The column items
has all the data on the products in a dictionary. It is possible to expand it in a similar manner as in line 3, but I suggest strongly against it. I will explain why later. So, line 2 explodes each line according to amount of items in the kit. Third line extracts prod_name
and prod_quantity
, and finally at the last line you get rid of the original data.
So why you should not have a table with variable amount of columns? You will never know how many items you have in each kit. You will to fiddle to get values for those variable columns. It is even worse than having information inside the dictionary.
To get the result in exact way you've asked, just run following:
data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'] )
tmp = df['items'].apply(lambda it: [{'product'+str(indx+1):x['product']['short_code'], 'quantity'+str(indx+1):x['quantity']} for indx,x in enumerate(it)])
tmp = tmp.apply(lambda x : {k:el[k] for el in x for k in el})
tmp = pd.DataFrame.from_records(tmp)
df = pd.concat([df, tmp], axis=1)
df = df.drop(columns=['items', 'id'])
On the data you put online the result is:
kit quantity transaction_no is_delivered flow transaction_date receiver_client warehouse product1 quantity1 product2 quantity2 product3 quantity3 product4 quantity4 product5 quantity5
0 KIT1182A 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5 PS002 5 PL001 5 FIN1182A 30 FSS001 18
1 KIT1182B 5 1180 False 36 2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited Yantraksh Logistics Private limited_GGNPC1 PP001 5 PS001 5 PL001 5 FIN1182B 20 FSS001 25
2 KIT1151 14 1179 False 1 2020-08-11T04:31:31.245000Z Mahindra & Mahindra_Kandivali Yantraksh Logistics Private limited_GGNPC1 PP001 14 PS001 14 PL001 14 FIN1151A 28 FSS001 42
3 KIT1151 15 1178 False 32 2020-08-10T04:30:12.022000Z Mahindra Vehicle Manufacturers Pune Yantraksh Logistics Private limited_GGNPC1 PP001 15 PS001 15 PL001 15 FIN1151A 29 FSS001 43
Upvotes: 7
Reputation: 7604
Using json_normalize
you can get this:
data = json.loads(d)
df = pd.json_normalize(data,
record_path=['kits', 'items'],
meta=[
['kits', 'kit'],
['id'],
['kits', 'quantity'],
['warehouse'],
['receiver_client']
],
meta_prefix='top')
print(df)
id quantity product.name ... topkits.quantity topwarehouse topreceiver_client
0 254 5 Plastic Pallet ... 5 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
1 258 18 Separator Sheet ... 5 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
2 259 5 Plastic Pallet ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
3 260 5 Plastic Sidewall ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
4 261 5 Plastic Lid ... 7 Yantraksh Logistics Private limited_GGNPC1 Lumax Cornaglia Auto Tech Private Limited
Upvotes: 1