Mohseen Mulla
Mohseen Mulla

Reputation: 592

How to create the given jSON format from a pandas dataframe?

The data looks like this :

enter image description here

The expected Json fomat is like this

    {
    "DataExtractName": "SalesDataExtract",
    "BusinessName" : {
        "InvoiceDate": {
            "SourceSystem": {
                "MYSQL" : "Invc_Dt",
                "CSV" : "Invc_Date"
            },
            "DataType": {
                "MYSQL" : "varchar",
                "CSV" : "string"
            }
        },
        "Description": {
            "SourceSystem": {
                "MYSQL" : "Prod_Desc",
                "CSV" : "Prod_Descr"
            },
            "DataType": {
                "MYSQL" : "varchar",
                "CSV" : "string"
            }
        }
    }
},
{
    "DataExtractName": "DateDataExtract",
    "BusinessName" : {
        "InvoiceDate": {
            "SourceSystem": {
                "MYSQL" : "Date"
            },
            "DataType": {
                "MYSQL" : "varchar"
            }
        }
    }
}

How do i achieve this using python dataframes? Or do i need to write some script to make the data like this?

Note

I've tried using -

  1. df.to_json
  2. df.to_dict

Upvotes: 1

Views: 577

Answers (1)

above_c_level
above_c_level

Reputation: 3929

With so many nested structures, you should use marshmallow. It is built with your use case in mind. Please check out the excellent documentation: https://marshmallow.readthedocs.io/en/stable/ . All you need is the masic usage.

It is a lot of code, but better be explicit than clever. I am sure a shorter solution exists, but it is probably unmaintainable. Also I had to build your dataframe. Please provide it in a data format next time.

import pandas as pd
import marshmallow as ma

# build test data
df = pd.DataFrame.from_records([
                               ['InvoiceDate', 'MYSQL', 'Invc_Dt', 'varchar', 'SalesDataExtract'],
                               ['InvoiceDate', 'CSV', 'Invc_Date', 'string', 'SalesDataExtract'], 
                               ['Description', 'MYSQL', 'Prod_Descr', 'varchar', 'SalesDataExtract'],
                               ['Description', 'CSV', 'Prod_Descr', 'string', 'SalesDataExtract'],
                               ['InvoiceDate', 'MYSQL', 'Date', 'varchar', 'DateDataExtract'],]
                        )
df.columns = ['BusinessName', 'SourceSystem', 'FunctionalName', 'DataType', 'DataExtractName']


# define marshmallow schemas
class SourceSystemTypeSchema(ma.Schema):
    MYSQL = ma.fields.String()
    CSV = ma.fields.String()

class DataTypeSchema(ma.Schema):
    MYSQL = ma.fields.String()
    CSV = ma.fields.String()

class InvoiceDateSchema(ma.Schema):
    InvoiceDate = ma.fields.Nested(SourceSystemTypeSchema())
    DataType = ma.fields.Nested(DataTypeSchema())

class DescriptionSchema(ma.Schema):
    SourceSystem = ma.fields.Nested(SourceSystemTypeSchema())
    DataType = ma.fields.Nested(DataTypeSchema())

class BusinessNameSchema(ma.Schema):
    InvoiceDate = ma.fields.Nested(InvoiceDateSchema())
    Description = ma.fields.Nested(DescriptionSchema())

class DataSchema(ma.Schema):
    DataExtractName = ma.fields.String()
    BusinessName = ma.fields.Nested(BusinessNameSchema())

# building json
result = []

mask_business_name_invoicedate = df.BusinessName == 'InvoiceDate'
mask_business_name_description = df.BusinessName == 'Description'

for data_extract_name in set(df['DataExtractName'].to_list()):
    mask_data_extract_name = df.DataExtractName == data_extract_name

    # you need these two helper dfs to get the dictionaries 
    df_source_system = df[mask_data_extract_name & mask_business_name_invoicedate].set_index('SourceSystem').to_dict(orient='dict')
    df_description = df[mask_data_extract_name & mask_business_name_description].set_index('SourceSystem').to_dict(orient='dict')

    # all dictionaries are defined, so you can use your schemas
    source_system_type = SourceSystemTypeSchema().dump(df_source_system['FunctionalName'])
    data_type = DataTypeSchema().dump(df_source_system['DataType'])
    source_system = SourceSystemTypeSchema().dump(df_description['FunctionalName'])
    invoice_date = InvoiceDateSchema().dump({'SourceSystemType': source_system_type, 'DataType': data_type})
    description = DescriptionSchema().dump({'SourceSystem': source_system, 'DataType': data_type})
    business_name = BusinessNameSchema().dump({'InvoiceDate': invoice_date, 'Description': description})
    data = DataSchema().dump({'DataExtractName': data_extract_name, 'BusinessName': business_name})

    # end result
    result.append(data)

Now,

ma.pprint(result)

returns

[{'BusinessName': {'Description': {'DataType': {'CSV': 'string',
                                                'MYSQL': 'varchar'},
                                   'SourceSystem': {'CSV': 'Prod_Descr',
                                                    'MYSQL': 'Prod_Descr'}},
                   'InvoiceDate': {'DataType': {'CSV': 'string',
                                                'MYSQL': 'varchar'}}},
  'DataExtractName': 'SalesDataExtract'},
 {'BusinessName': {'Description': {'DataType': {'MYSQL': 'varchar'},
                                   'SourceSystem': {}},
                   'InvoiceDate': {'DataType': {'MYSQL': 'varchar'}}},
  'DataExtractName': 'DateDataExtract'}]

Upvotes: 2

Related Questions