Reputation: 17724
How can I use pandas to append multiple KPI values per single customer efficiently?
A join of the pivoted
df with the customers
df makes some problems because the country is the index of the pivoted data frame and the nationality is not in the index.
countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
'indicator':['z','x','z','x'],
'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
'nationality':['Germany','Austria'],
'value':[7,8]})
See the desired result in pink:
Upvotes: 1
Views: 720
Reputation: 29719
You could counter the mismatch in the categories through merge
:
df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator'])
df.index.name = 'nationality'
customers.merge(df['value'].reset_index(), on='nationality', how='outer')
Data:
countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
'indicator':['z','x','z','x'],
'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
'nationality':['Slovakia','Austria'],
'value':[7,8]})
The problem appears to be that you have got CategoricalIndex
in your DF
resulting from the pivot
operation and when you perform reset_index
on that complains you of that error.
Simply do reverse engineering as in check the dtypes
of countryKPI
and customers
Dataframes and wherever there is category
mentioned, convert those columns to their string
representation via astype(str)
Reproducing the Error and Countering it:
Assume the DF
to be the above mentioned:
countryKPI['indicator'] = countryKPI['indicator'].astype('category')
countryKPI['country'] = countryKPI['country'].astype('category')
customers['nationality'] = customers['nationality'].astype('category')
countryKPI.dtypes
country category
indicator category
value int64
dtype: object
customers.dtypes
customer object
nationality category
value int64
dtype: object
After pivot
operation:
df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator'])
df.index
CategoricalIndex(['Austria', 'Germany'], categories=['Austria', 'Germany'], ordered=False,
name='country', dtype='category')
# ^^ See the categorical index
When you perform reset_index
on that:
df.reset_index()
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
To counter that error, simply cast the categorical columns to str
type.
countryKPI['indicator'] = countryKPI['indicator'].astype('str')
countryKPI['country'] = countryKPI['country'].astype('str')
customers['nationality'] = customers['nationality'].astype('str')
Now, the reset_index
part works and even the merge
too.
Upvotes: 1
Reputation: 863531
I think you can use concat
:
df_pivoted = countryKPI.pivot_table(index='country',
columns='indicator',
values='value',
fill_value=0)
print (df_pivoted)
indicator x z
country
Austria 7 7
Germany 8 9
print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1))
customer value x z
Austria second 8 7 7
Germany first 7 8 9
print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)
.reset_index()
.rename(columns={'index':'nationality'})
[['customer','nationality','value','x','z']])
customer nationality value x z
0 second Austria 8 7 7
1 first Germany 7 8 9
EDIT by comments:
Problem is dtypes
of columns customers.nationality
and countryKPI.country
is category
and if some categories are missing, it raise error:
ValueError: incompatible categories in categorical concat
Solution find common categories by union
and then set_categories:
import pandas as pd
import numpy as np
countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'],
'indicator':['z','x','z','x'],
'value':[7,8,9,7]})
customers = pd.DataFrame({'customer':['first','second'],
'nationality':['Slovakia','Austria'],
'value':[7,8]})
customers.nationality = customers.nationality.astype('category')
countryKPI.country = countryKPI.country.astype('category')
print (countryKPI.country.cat.categories)
Index(['Austria', 'Germany'], dtype='object')
print (customers.nationality.cat.categories)
Index(['Austria', 'Slovakia'], dtype='object')
all_categories =countryKPI.country.cat.categories.union(customers.nationality.cat.categories)
print (all_categories)
Index(['Austria', 'Germany', 'Slovakia'], dtype='object')
customers.nationality = customers.nationality.cat.set_categories(all_categories)
countryKPI.country = countryKPI.country.cat.set_categories(all_categories)
df_pivoted = countryKPI.pivot_table(index='country',
columns='indicator',
values='value',
fill_value=0)
print (df_pivoted)
indicator x z
country
Austria 7 7
Germany 8 9
Slovakia 0 0
print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)
.reset_index()
.rename(columns={'index':'nationality'})
[['customer','nationality','value','x','z']])
customer nationality value x z
0 second Austria 8.0 7 7
1 NaN Germany NaN 8 9
2 first Slovakia 7.0 0 0
If need better performance, instead pivot_table
use groupby
:
df_pivoted1 = countryKPI.groupby(['country','indicator'])
.mean()
.squeeze()
.unstack()
.fillna(0)
print (df_pivoted1)
indicator x z
country
Austria 7.0 7.0
Germany 8.0 9.0
Slovakia 0.0 0.0
Timings:
In [177]: %timeit countryKPI.pivot_table(index='country', columns='indicator', values='value', fill_value=0)
100 loops, best of 3: 6.24 ms per loop
In [178]: %timeit countryKPI.groupby(['country','indicator']).mean().squeeze().unstack().fillna(0)
100 loops, best of 3: 4.28 ms per loop
Upvotes: 2