user319436
user319436

Reputation: 173

How to join Multi-level dataframe on values in single-level dataframe

what I have so far is a normal transactional dataframe with the following columns:

store | item | year | month | day | sales

'year' can be 2015, 2016, 2017.

With that I created a summary dataframe:

store_item_years = df.groupby(
['store','item','year'])['sales'].agg(
[np.sum, np.mean, np.std, np.median, np.min, np.max]).unstack(
fill_value=0)

The last one results in a Multi-Index with 2 levels, like this:

               sum                  mean
        year | 2015 | 2016 | 2017 | 2015 | 2016 | 2017 | ... 
store | item   sum1    ...   ...    mean1  mean2  ...  | ...    

Now I'd like to merge the summary table back onto the transactional one:

store | item | year | month | day | sales | + | sum+'by'+year | mean+'by'+year
               2015                              sum1              mean1
               2016                              sum2              mean2
               2017                              ...                ...

I am trying to merge with the following:

df = pd.merge(df, store_item_years, 
             left_on=['store', 'item', 'year'], 
             right_on=['store', 'item', 'year'],
             how='left')

which results in the following error:

KeyError: 'year'

Any ideas? I am just getting my head around groupby. I haven't looked into PivotTable yet.

Please keep in mind that the problem is simplified. The number of store_item combinations is 200+K and other groupbys having 300+ columns. But always the same principle.

Thanks a lot in advance.

Upvotes: 2

Views: 1452

Answers (2)

user319436
user319436

Reputation: 173

found the culprit. Removed .unstack().

store_item_years = df.groupby(
   ['store','item','year'])['sales'].agg(
   [np.sum, np.mean, np.std, np.median, np.min, np.max])

The following to keep the context:

store_item_years.columns = store_item_years.columns+'_by_year'

And merge like this:

pd.merge(df, store_item_years.reset_index(), 
         left_on=['store', 'item', 'year'], 
         right_on=['store', 'item', 'year'],
         how='left')

Upvotes: 0

jezrael
jezrael

Reputation: 863226

I think you need first remove unstack and then use join for left join:

store_item_years = df.groupby(
['store','item','year'])['sales'].agg(
[np.sum, np.mean, np.std, np.median, np.min, np.max])

df = df.join(store_item_years, on=['store','item','year'])

Upvotes: 2

Related Questions