Reputation: 21
This is a sample of my pandas dataframe and it contains close to 100k rows
import pandas as pd
df = pd.DataFrame({'cluster': ['5', '5', '5', '5', '5', '5'],
'mdse_item_i': ['23627102',
'23627102',
'23627102',
'23627102',
'23627102',
'23627102'],
'predPriceQty': ['35.675543',
'33.236678',
'35.675543',
'35.675543',
'35.675543',
'35.675543'],
'schedule_i': ['56', '56', '56', '56', '56', '56'],
'segment_id': ['4123', '4123', '4144', '4161', '4295', '4454'],
'wk': ['1', '2', '1', '1', '1', '1']} )
segment_id | cluster | schedule_i | mdse_item_i | wk | predPriceQty |
---|---|---|---|---|---|
4123 | 5 | 56 | 23627102 | 1 | 35.675543 |
4123 | 5 | 56 | 23627102 | 2 | 33.236678 |
4144 | 5 | 56 | 23627102 | 1 | 35.675543 |
4161 | 5 | 56 | 23627102 | 1 | 35.675543 |
4295 | 5 | 56 | 23627102 | 1 | 35.675543 |
4454 | 5 | 56 | 23627102 | 1 | 35.675543 |
Below is the nested format of the dictionary i'm looking to achieve
{(4123, 5): {56.0: {23627102.0: {1: 35.6755430505491, 2:33.236678}}},
(4144, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
(4161, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
(4295, 5): {56.0: {23627102.0: {1: 35.6755430505491}}},
(4454, 5): {56.0: {23627102.0: {1: 35.6755430505491}}}}
The code below works for me but for huge dataframes, it takes hours to create the dictionary and i'm trying to avoid row-wise iteration
forecast_dict_all = {}
for _, row in df.iterrows():
item_agg_id = int(row[segment_id])
mdse_item_i = row["mdse_item_i"]
cluster = int(row["cluster"])
wk = int(row["wk"])
forecast = float(row["predPriceQty"])
schedule_id = row["schedule_i"]
if (item_agg_id, cluster) not in forecast_dict_all:
forecast_dict_all[item_agg_id, cluster] = {
schedule_id: {mdse_item_i: {wk: forecast}}
}
My solution so far
dict(df.groupby(['segment_id','cluster'],as_index=False).apply(lambda x: x.to_dict()).to_dict())
df.set_index(['segment_id', 'cluster'], inplace=True)
di = df.to_dict(orient='index')
forecast_dict_all = {k:{v['schedule_i']: {v['mdse_item_i']: {v['wk']: v['predPriceQty']}}}
for k,v in di.items()}
df.set_index(['segment_id', 'cluster'], inplace=True)
{k:{grp['schedule_i']: {grp['mdse_item_i']: {grp['wk']: grp['predPriceQty']}}}
for k, grp in df.groupby(['schedule_i','mdse_item_i','wk','predPriceQty'])}
I even tried using zipping but in both the instances, i could not achieve the desired output.
Edit I am using python: 2.7.13.final.0 pandas: 0.20.1
Any help is appreciated, thanks
Upvotes: 2
Views: 386
Reputation: 45552
I have no idea whether this will be faster, but it gives the expected output with your example data.
df = pd.DataFrame(d)
df = df.astype(dtype={'cluster': int, 'mdse_item_i': int, 'predPriceQty': float,
'schedule_i': int, 'segment_id': int, 'wk': int})
df.drop_duplicates(inplace=True)
df.set_index(['segment_id', 'cluster'], inplace=True)
answer = df.apply(lambda row:
{row['schedule_i']: {row['mdse_item_i']: {row['wk']: row['predPriceQty']}}},
axis=1).to_dict()
Result:
{(4123, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4144, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4161, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4295, 5): {56.0: {23627102.0: {1.0: 35.675543}}},
(4454, 5): {56.0: {23627102.0: {1.0: 35.675543}}}}
Note: I fixed the dataframe's types because you do so in your code, but the best time to get the types right is when creating the dataframe.
Upvotes: 3