Using a group_by associaton to make multiple and very different operations

Question

I have a dataframe that simulates purchases containing client's first names, second names, a paid amount and the item category of the item they bought.

df = pd.DataFrame({'FirstName':['Peter','John','Peter'],'SecondName':['Jordan','Jackson','Jordan'],'Paid Amount':[1,4,6],'ItemsCategory':[1,2,5]})

Previously in my code I did the same multiple group by several times, trying to improve it.

The code should group by first and second name and calculate the following:

Get the total paid amount (a simple sum).
Generate a unique id (first name + last name + a random int).

Map each purchase to a category (using a dict, see code below) I map the category to 1/0 if the bought a belonging product.

    FirstName SecondName  Paid Amount  ItemsCategory
0     Peter     Jordan            1              1
1      John    Jackson            4              2
2     Peter     Jordan            6              5

I am using the following dictionary to map the item ids to its category:

product_type={'healthPr':[1],'makeUpPr':[3,2]}

The desired output would be:

  FirstName SecondName     ClientID     TotalSpent healthPr makeUpPr
0     Peter     Jordan  PeterJordan-111      7        0         1
1      John    Jackson  JohnJackson-222      4        1         0

I got it working for the point 1,3 with the following code:

df_group = df.groupby(['FirstName','SecondName'])

 def get_total_paid(grouped_df):
    return sum(grouped_df['Paid Amount'])

 def get_products_by_type(grouped_df):
    final_series={}
    for k,v in product_type.items():
        cond_ = {k:''}
        if any(x in v for x in grouped_df['ItemsCategory']):
            cond_[k]= 1
        else:
            cond_[k]= 0
        final_series[k] = cond_[k]
    return pd.Series(final_series)

df_group.apply(get_total_paid)

df_group.apply(get_products_by_type)

For the point 2, get the ID it is a bit more complex because some customers have more than one entry per group conditions, I have to do the aggregation myself:

def get_customer_id(grouped_df):
    print(grouped_df)
    return f"{grouped_df['FirstName']}{grouped_df['SecondName']}-{random.randint(0,100)}"

Ideally I would like to have something like this where then I can keep adding my aggregations:

grouped_df = df.groupby(['FirstName','SecondName'])


calc_df = pd.Series({'CustomerID':grouped_df.apply(get_customer_id),
'TotalSpent':grouped_df.apply(get_total_paid),
'CategoriesBought':grouped_df.apply(get_products_by_type)})

But it is actually crashing because of the customerID, I want to avoid using lambda on the series as I want to do some exception catch (e.g name is empty string).

Henry Ecker · Accepted Answer

I'd recommend a different approach.

Create a mapper to map values in the ItemsCategory column from number to category and use pd.get_dummies to encode into new columns:

# Replace Categories With Mapper
mapper = {v: k for k, lst in product_type.items() for v in lst}
df['ItemsCategory'] = df['ItemsCategory'].map(mapper)
# Expand Out Into Dummies (NaN will be dropped)
df = pd.get_dummies(df, columns=['ItemsCategory'], prefix_sep='', prefix='')

df:

  FirstName SecondName  Paid Amount  healthPr  makeUpPr
0     Peter     Jordan            1         1         0
1      John    Jackson            4         0         1
2     Peter     Jordan            6         0         0

Then to get totals groupby sum works:

# Groupby columns
g = df.groupby(['FirstName', 'SecondName'], as_index=False, sort=False)
df = g.sum()

For the ClientID, I'd recommend using np.random.randint which can generate an array that can be concatenated with First and Second names:

# Add ClientID column using randint instead
df.insert(2, 'ClientID', df['FirstName'] +
          df['SecondName'] + '-' +
          np.random.randint(0, 100, len(df)).astype(str))

df:

  FirstName SecondName        ClientID  Paid Amount  healthPr  makeUpPr
0     Peter     Jordan  PeterJordan-81            7         1         0
1      John    Jackson  JohnJackson-45            4         0         1

Together

import numpy as np
import pandas as pd

df = pd.DataFrame({'FirstName': ['Peter', 'John', 'Peter'],
                   'SecondName': ['Jordan', 'Jackson', 'Jordan'],
                   'Paid Amount': [1, 4, 6], 'ItemsCategory': [1, 2, 5]})

product_type = {'healthPr': [1], 'makeUpPr': [3, 2]}
# Replace Categories With Mapper
mapper = {v: k for k, lst in product_type.items() for v in lst}
df['ItemsCategory'] = df['ItemsCategory'].map(mapper)
# Expand Out Into Dummies (NaN will be dropped)
df = pd.get_dummies(df, columns=['ItemsCategory'], prefix_sep='', prefix='')
# Groupby columns
g = df.groupby(['FirstName', 'SecondName'], as_index=False, sort=False)
df = g.sum()
# Add ClientID column using np.randint instead
df.insert(2, 'ClientID', df['FirstName'] +
          df['SecondName'] + '-' +
          np.random.randint(0, 100, len(df)).astype(str))

print(df)

Using a group_by associaton to make multiple and very different operations

Answers (1)

Related Questions