Georg Heiler
Georg Heiler

Reputation: 17724

Pandas append multiple columns for a single one

How can I use pandas to append multiple KPI values per single customer efficiently?

A join of the pivoted df with the customers df makes some problems because the country is the index of the pivoted data frame and the nationality is not in the index.

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
                           'indicator':['z','x','z','x'],
                           'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
                           'nationality':['Germany','Austria'],
                           'value':[7,8]})

See the desired result in pink: enter image description here

Upvotes: 1

Views: 720

Answers (2)

Nickil Maveli
Nickil Maveli

Reputation: 29719

You could counter the mismatch in the categories through merge:

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator'])
df.index.name = 'nationality'    
customers.merge(df['value'].reset_index(), on='nationality', how='outer')

Image

Data:

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
                           'indicator':['z','x','z','x'],
                           'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
                           'nationality':['Slovakia','Austria'],
                           'value':[7,8]})

The problem appears to be that you have got CategoricalIndex in your DF resulting from the pivot operation and when you perform reset_index on that complains you of that error.

Simply do reverse engineering as in check the dtypes of countryKPI and customers Dataframes and wherever there is category mentioned, convert those columns to their string representation via astype(str)


Reproducing the Error and Countering it:

Assume the DF to be the above mentioned:

countryKPI['indicator'] = countryKPI['indicator'].astype('category')
countryKPI['country'] = countryKPI['country'].astype('category')
customers['nationality'] = customers['nationality'].astype('category')

countryKPI.dtypes
country      category
indicator    category
value           int64
dtype: object

customers.dtypes
customer         object
nationality    category
value             int64
dtype: object

After pivot operation:

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator'])
df.index
CategoricalIndex(['Austria', 'Germany'], categories=['Austria', 'Germany'], ordered=False, 
                  name='country', dtype='category')
# ^^ See the categorical index

When you perform reset_index on that:

df.reset_index()

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

To counter that error, simply cast the categorical columns to str type.

countryKPI['indicator'] = countryKPI['indicator'].astype('str')
countryKPI['country'] = countryKPI['country'].astype('str')
customers['nationality'] = customers['nationality'].astype('str')

Now, the reset_index part works and even the merge too.

Upvotes: 1

jezrael
jezrael

Reputation: 863531

I think you can use concat:

df_pivoted = countryKPI.pivot_table(index='country', 
                              columns='indicator', 
                              values='value', 
                              fill_value=0)
print (df_pivoted)    
indicator  x  z
country        
Austria    7  7
Germany    8  9

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1))
        customer  value  x  z
Austria   second      8  7  7
Germany    first      7  8  9                       


print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)
         .reset_index()
         .rename(columns={'index':'nationality'})
         [['customer','nationality','value','x','z']])

  customer nationality  value  x  z
0   second     Austria      8  7  7
1    first     Germany      7  8  9

EDIT by comments:

Problem is dtypes of columns customers.nationality and countryKPI.country is category and if some categories are missing, it raise error:

ValueError: incompatible categories in categorical concat

Solution find common categories by union and then set_categories:

import pandas as pd
import numpy as np

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
                           'indicator':['z','x','z','x'],
                           'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
                           'nationality':['Slovakia','Austria'],
                           'value':[7,8]})

customers.nationality = customers.nationality.astype('category')
countryKPI.country = countryKPI.country.astype('category')

print (countryKPI.country.cat.categories)
Index(['Austria', 'Germany'], dtype='object')

print (customers.nationality.cat.categories)
Index(['Austria', 'Slovakia'], dtype='object')

all_categories =countryKPI.country.cat.categories.union(customers.nationality.cat.categories)
print (all_categories)
Index(['Austria', 'Germany', 'Slovakia'], dtype='object')

customers.nationality = customers.nationality.cat.set_categories(all_categories)
countryKPI.country = countryKPI.country.cat.set_categories(all_categories)
df_pivoted = countryKPI.pivot_table(index='country', 
                              columns='indicator', 
                              values='value', 
                              fill_value=0)
print (df_pivoted)    
indicator  x  z
country        
Austria    7  7
Germany    8  9
Slovakia   0  0        

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)
         .reset_index()
         .rename(columns={'index':'nationality'})
         [['customer','nationality','value','x','z']])

  customer nationality  value  x  z
0   second     Austria    8.0  7  7
1      NaN     Germany    NaN  8  9
2    first    Slovakia    7.0  0  0

If need better performance, instead pivot_table use groupby:

df_pivoted1 = countryKPI.groupby(['country','indicator'])
                        .mean()
                        .squeeze()
                        .unstack()
                        .fillna(0)
print (df_pivoted1)
indicator    x    z
country            
Austria    7.0  7.0
Germany    8.0  9.0
Slovakia   0.0  0.0

Timings:

In [177]: %timeit countryKPI.pivot_table(index='country', columns='indicator', values='value', fill_value=0)
100 loops, best of 3: 6.24 ms per loop

In [178]: %timeit countryKPI.groupby(['country','indicator']).mean().squeeze().unstack().fillna(0)
100 loops, best of 3: 4.28 ms per loop

Upvotes: 2

Related Questions