Flatten a nested dictionary and convert it into columns

Question

I have a JSON which I converted into a dictionary and trying to make a dataframe out of it. the problem is that it is multiple nested and with inconsistent data

For e.g.

d = """[
      {
        "id": 51,
        "kits": [
            {
                "id": 57,
                "kit": "KIT1182A",
                "items": [
                    {
                        "id": 254,
                        "product": {
                            "name": "Plastic Pallet",
                            "short_code": "PP001",
                            "priceperunit": 2500,
                            "volumetric_weight": 21.34
                        },
                        "quantity": 5
                    },
                    {
                        "id": 258,
                        "product": {
                            "name": "Separator Sheet",
                            "short_code": "FSS001",
                            "priceperunit": 170,
                            "volumetric_weight": 0.9
                        },
                        "quantity": 18
                    }
                ],
                "quantity": 5
            },                                     #end of kit
            {
                "id": 58,
                "kit": "KIT1182B",
                "items": [
                    {
                        "id": 259,
                        "product": {
                            "name": "Plastic Pallet",
                            "short_code": "PP001",
                            "priceperunit": 2500,
                            "volumetric_weight": 21.34
                        },
                        "quantity": 5
                    },
                    {
                        "id": 260,
                        "product": {
                            "name": "Plastic Sidewall",
                            "short_code": "PS001",
                            "priceperunit": 1250,
                            "volumetric_weight": 16.1
                        },
                        "quantity": 5
                    },
                    {
                        "id": 261,
                        "product": {
                            "name": "Plastic Lid",
                            "short_code": "PL001",
                            "priceperunit": 1250,
                            "volumetric_weight": 9.7
                        },
                        "quantity": 5
                    }
                   
                ],
                "quantity": 7
            }                                    #end of kit
        ],
        "warehouse": "Yantraksh Logistics Private limited_GGNPC1",
        "receiver_client": "Lumax Cornaglia Auto Tech Private Limited",
        "transport_by": "Kiran Roadways",
        "transaction_type": "Return",
        "transaction_date": "2020-08-13T04:34:11.678000Z",
        "transaction_no": 1180,
        "is_delivered": false,
        "driver_name": "__________",
        "driver_number": "__________",
        "lr_number": 0,
        "vehicle_number": "__________",
        "freight_charges": 0,
        "vehicle_type": "Part Load",
        "remarks": "0",
        "flow": 36,
        "owner": 2
    } ]"""

I want to convert it into a dataframe like the following:

transaction_no  is_delivered    flow    transaction_date    receiver_client warehouse   kits    quantity    product1    quantity1   product2    quantity2   product3    quantity3
1180       False    36  2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited   Yantraksh Logistics Private limited_GGNPC1  KIT1182A    5   PP001   5   FSS001  18  NaN NaN
1180       False    36  2020-08-13T04:34:11.678000Z Lumax Cornaglia Auto Tech Private Limited   Yantraksh Logistics Private limited_GGNPC1  KIT1182B    7   PP001   5   PS001   5   PL001   7.0

or to show it in a better way:

What I have done:

data = json.loads(d)
result_dataframe = pd.DataFrame(data)
l = ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse','kits']  #fields that I need
result_dataframe = result_dataframe[l]
result_dataframe.to_csv("out.csv")

I tried :

def flatten(input_dict, separator='_', prefix=''):
output_dict = {}
for key, value in input_dict.items():
    if isinstance(value, dict) and value:
        deeper = flatten(value, separator, prefix+key+separator)
        output_dict.update({key2: val2 for key2, val2 in deeper.items()})
    elif isinstance(value, list) and value:
        for index, sublist in enumerate(value, start=1):
            if isinstance(sublist, dict) and sublist:
                deeper = flatten(sublist, separator, prefix+key+separator+str(index)+separator)
                output_dict.update({key2: val2 for key2, val2 in deeper.items()})
            else:
                output_dict[prefix+key+separator+str(index)] = value
    else:
        output_dict[prefix+key] = value
return output_dict

But it gives all the values in a single row, how can I seprate them on the basis of kits and get the result?

igrinis · Accepted Answer

Data transformations like above are very common. Pandas provide lots of tools to help you in this task.

data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'])  # line 1
df = df.explode('items')  # line 2
df[['product_code', 'product_quantity']] = df['items'].apply(lambda x: pd.Series([x['product']['short_code'], x['quantity']]))  # line 3
df.drop(columns=['items'])  # line 4

will provide you with

  id       kit  quantity transaction_no is_delivered flow             transaction_date                            receiver_client                                   warehouse product_code  product_quantity
0  57  KIT1182A  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001        5               
0  57  KIT1182A  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  FSS001       18              
1  58  KIT1182B  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001        5               
1  58  KIT1182B  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PS001        5               
1  58  KIT1182B  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PL001        5

Actually, the trick is only in the pd.json_normalize (line 1). It will create a dataframe very similar to what you've asked for:

   id       kit                                                                                                                                                                                                                                                                                                                                                                                                                             items  quantity transaction_no is_delivered flow             transaction_date                            receiver_client                                   warehouse
0  57  KIT1182A  [{'id': 254, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 258, 'product': {'name': 'Separator Sheet', 'short_code': 'FSS001', 'priceperunit': 170, 'volumetric_weight': 0.9}, 'quantity': 18}]                                                                                                                                          5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1
1  58  KIT1182B  [{'id': 259, 'product': {'name': 'Plastic Pallet', 'short_code': 'PP001', 'priceperunit': 2500, 'volumetric_weight': 21.34}, 'quantity': 5}, {'id': 260, 'product': {'name': 'Plastic Sidewall', 'short_code': 'PS001', 'priceperunit': 1250, 'volumetric_weight': 16.1}, 'quantity': 5}, {'id': 261, 'product': {'name': 'Plastic Lid', 'short_code': 'PL001', 'priceperunit': 1250, 'volumetric_weight': 9.7}, 'quantity': 5}]  7         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1

The column items has all the data on the products in a dictionary. It is possible to expand it in a similar manner as in line 3, but I suggest strongly against it. I will explain why later. So, line 2 explodes each line according to amount of items in the kit. Third line extracts prod_name and prod_quantity, and finally at the last line you get rid of the original data.

So why you should not have a table with variable amount of columns? You will never know how many items you have in each kit. You will to fiddle to get values for those variable columns. It is even worse than having information inside the dictionary.

update

To get the result in exact way you've asked, just run following:

data = json.loads(d)
df = pd.json_normalize(data, record_path=['kits'], meta= ['transaction_no', 'is_delivered','flow', 'transaction_date', 'receiver_client', 'warehouse'] )
tmp = df['items'].apply(lambda it: [{'product'+str(indx+1):x['product']['short_code'], 'quantity'+str(indx+1):x['quantity']} for indx,x in enumerate(it)])
tmp = tmp.apply(lambda x :  {k:el[k] for el in x for k in el})
tmp = pd.DataFrame.from_records(tmp)
df = pd.concat([df, tmp], axis=1)
df = df.drop(columns=['items', 'id'])

On the data you put online the result is:

        kit  quantity transaction_no is_delivered flow             transaction_date                            receiver_client                                   warehouse product1  quantity1 product2  quantity2 product3  quantity3  product4  quantity4 product5  quantity5
0  KIT1182A  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001    5          PS002    5          PL001    5          FIN1182A  30         FSS001   18       
1  KIT1182B  5         1180           False        36   2020-08-13T04:34:11.678000Z  Lumax Cornaglia Auto Tech Private Limited  Yantraksh Logistics Private limited_GGNPC1  PP001    5          PS001    5          PL001    5          FIN1182B  20         FSS001   25       
2  KIT1151   14        1179           False        1    2020-08-11T04:31:31.245000Z  Mahindra & Mahindra_Kandivali              Yantraksh Logistics Private limited_GGNPC1  PP001    14         PS001    14         PL001    14         FIN1151A  28         FSS001   42       
3  KIT1151   15        1178           False        32   2020-08-10T04:30:12.022000Z  Mahindra Vehicle Manufacturers Pune        Yantraksh Logistics Private limited_GGNPC1  PP001    15         PS001    15         PL001    15         FIN1151A  29         FSS001   43

Flatten a nested dictionary and convert it into columns

Answers (2)

update

Related Questions