Reputation: 1190
I have a dataframe that simulates purchases containing client's first names, second names, a paid amount and the item category of the item they bought.
df = pd.DataFrame({'FirstName':['Peter','John','Peter'],'SecondName':['Jordan','Jackson','Jordan'],'Paid Amount':[1,4,6],'ItemsCategory':[1,2,5]})
Previously in my code I did the same multiple group by several times, trying to improve it.
The code should group by first and second name and calculate the following:
Get the total paid amount (a simple sum).
Generate a unique id (first name + last name + a random int).
Map each purchase to a category (using a dict, see code below) I map the category to 1/0 if the bought a belonging product.
FirstName SecondName Paid Amount ItemsCategory
0 Peter Jordan 1 1
1 John Jackson 4 2
2 Peter Jordan 6 5
I am using the following dictionary to map the item ids to its category:
product_type={'healthPr':[1],'makeUpPr':[3,2]}
The desired output would be:
FirstName SecondName ClientID TotalSpent healthPr makeUpPr
0 Peter Jordan PeterJordan-111 7 0 1
1 John Jackson JohnJackson-222 4 1 0
I got it working for the point 1,3 with the following code:
df_group = df.groupby(['FirstName','SecondName'])
def get_total_paid(grouped_df):
return sum(grouped_df['Paid Amount'])
def get_products_by_type(grouped_df):
final_series={}
for k,v in product_type.items():
cond_ = {k:''}
if any(x in v for x in grouped_df['ItemsCategory']):
cond_[k]= 1
else:
cond_[k]= 0
final_series[k] = cond_[k]
return pd.Series(final_series)
df_group.apply(get_total_paid)
df_group.apply(get_products_by_type)
For the point 2, get the ID it is a bit more complex because some customers have more than one entry per group conditions, I have to do the aggregation myself:
def get_customer_id(grouped_df):
print(grouped_df)
return f"{grouped_df['FirstName']}{grouped_df['SecondName']}-{random.randint(0,100)}"
Ideally I would like to have something like this where then I can keep adding my aggregations:
grouped_df = df.groupby(['FirstName','SecondName'])
calc_df = pd.Series({'CustomerID':grouped_df.apply(get_customer_id),
'TotalSpent':grouped_df.apply(get_total_paid),
'CategoriesBought':grouped_df.apply(get_products_by_type)})
But it is actually crashing because of the customerID, I want to avoid using lambda on the series as I want to do some exception catch (e.g name is empty string).
Upvotes: 0
Views: 43
Reputation: 35636
I'd recommend a different approach.
Create a mapper to map
values in the ItemsCategory
column from number to category and use pd.get_dummies
to encode into new columns:
# Replace Categories With Mapper
mapper = {v: k for k, lst in product_type.items() for v in lst}
df['ItemsCategory'] = df['ItemsCategory'].map(mapper)
# Expand Out Into Dummies (NaN will be dropped)
df = pd.get_dummies(df, columns=['ItemsCategory'], prefix_sep='', prefix='')
df
:
FirstName SecondName Paid Amount healthPr makeUpPr
0 Peter Jordan 1 1 0
1 John Jackson 4 0 1
2 Peter Jordan 6 0 0
Then to get totals groupby sum
works:
# Groupby columns
g = df.groupby(['FirstName', 'SecondName'], as_index=False, sort=False)
df = g.sum()
For the ClientID, I'd recommend using np.random.randint
which can generate an array that can be concatenated with First and Second names:
# Add ClientID column using randint instead
df.insert(2, 'ClientID', df['FirstName'] +
df['SecondName'] + '-' +
np.random.randint(0, 100, len(df)).astype(str))
df
:
FirstName SecondName ClientID Paid Amount healthPr makeUpPr
0 Peter Jordan PeterJordan-81 7 1 0
1 John Jackson JohnJackson-45 4 0 1
Together
import numpy as np
import pandas as pd
df = pd.DataFrame({'FirstName': ['Peter', 'John', 'Peter'],
'SecondName': ['Jordan', 'Jackson', 'Jordan'],
'Paid Amount': [1, 4, 6], 'ItemsCategory': [1, 2, 5]})
product_type = {'healthPr': [1], 'makeUpPr': [3, 2]}
# Replace Categories With Mapper
mapper = {v: k for k, lst in product_type.items() for v in lst}
df['ItemsCategory'] = df['ItemsCategory'].map(mapper)
# Expand Out Into Dummies (NaN will be dropped)
df = pd.get_dummies(df, columns=['ItemsCategory'], prefix_sep='', prefix='')
# Groupby columns
g = df.groupby(['FirstName', 'SecondName'], as_index=False, sort=False)
df = g.sum()
# Add ClientID column using np.randint instead
df.insert(2, 'ClientID', df['FirstName'] +
df['SecondName'] + '-' +
np.random.randint(0, 100, len(df)).astype(str))
print(df)
Upvotes: 1