Vivek
Vivek

Reputation: 21

convert dataframe to a nested dictionary without column names

This is a sample of my pandas dataframe and it contains close to 100k rows

import pandas as pd
df = pd.DataFrame({'cluster': ['5', '5', '5', '5', '5', '5'],
         'mdse_item_i': ['23627102',
                         '23627102',
                         '23627102',
                         '23627102',
                         '23627102',
                         '23627102'],
         'predPriceQty': ['35.675543',
                         '33.236678',
                         '35.675543',
                         '35.675543',
                         '35.675543',
                         '35.675543'],
         'schedule_i': ['56', '56', '56', '56', '56', '56'],
         'segment_id': ['4123', '4123', '4144', '4161', '4295', '4454'],
         'wk': ['1', '2', '1', '1', '1', '1']} )
segment_id cluster schedule_i mdse_item_i wk predPriceQty
4123 5 56 23627102 1 35.675543
4123 5 56 23627102 2 33.236678
4144 5 56 23627102 1 35.675543
4161 5 56 23627102 1 35.675543
4295 5 56 23627102 1 35.675543
4454 5 56 23627102 1 35.675543

Below is the nested format of the dictionary i'm looking to achieve

{(4123, 5): {56.0: {23627102.0: {1: 35.6755430505491, 2:33.236678}}},
 (4144, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
 (4161, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
 (4295, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
 (4454, 5): {56.0: {23627102.0: {1: 35.6755430505491}}}}

The code below works for me but for huge dataframes, it takes hours to create the dictionary and i'm trying to avoid row-wise iteration

forecast_dict_all = {}
for _, row in df.iterrows():
        item_agg_id = int(row[segment_id])
        mdse_item_i = row["mdse_item_i"]
        cluster = int(row["cluster"])
        wk = int(row["wk"])
        forecast = float(row["predPriceQty"])
        schedule_id = row["schedule_i"]
        
        if (item_agg_id, cluster) not in forecast_dict_all:
            forecast_dict_all[item_agg_id, cluster] = {
                schedule_id: {mdse_item_i: {wk: forecast}}
            }

My solution so far

dict(df.groupby(['segment_id','cluster'],as_index=False).apply(lambda x: x.to_dict()).to_dict())
df.set_index(['segment_id', 'cluster'], inplace=True)
    
di = df.to_dict(orient='index')
    
forecast_dict_all = {k:{v['schedule_i']: {v['mdse_item_i']: {v['wk']: v['predPriceQty']}}} 
                            for k,v in di.items()}
df.set_index(['segment_id', 'cluster'], inplace=True)
{k:{grp['schedule_i']: {grp['mdse_item_i']: {grp['wk']: grp['predPriceQty']}}}
for k, grp in df.groupby(['schedule_i','mdse_item_i','wk','predPriceQty'])}

I even tried using zipping but in both the instances, i could not achieve the desired output.

Edit I am using python: 2.7.13.final.0 pandas: 0.20.1

Any help is appreciated, thanks

Upvotes: 2

Views: 386

Answers (1)

Steven Rumbalski
Steven Rumbalski

Reputation: 45552

I have no idea whether this will be faster, but it gives the expected output with your example data.

df = pd.DataFrame(d)
df = df.astype(dtype={'cluster': int, 'mdse_item_i': int, 'predPriceQty': float,
                'schedule_i': int, 'segment_id': int, 'wk': int})
df.drop_duplicates(inplace=True)
df.set_index(['segment_id', 'cluster'], inplace=True)
answer = df.apply(lambda row:
                {row['schedule_i']: {row['mdse_item_i']: {row['wk']: row['predPriceQty']}}},
                axis=1).to_dict()

Result:

{(4123, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
 (4144, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
 (4161, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
 (4295, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
 (4454, 5): {56.0: {23627102.0: {1.0: 35.675543}}}}

Note: I fixed the dataframe's types because you do so in your code, but the best time to get the types right is when creating the dataframe.

Upvotes: 3

Related Questions