Ciaran O Brien
Ciaran O Brien

Reputation: 384

Transform data to attribute:value pairs while maintaining ID

I have a dataset of customer information which has a unique ID per customer; an identity variable which considers their status such as retired, student, etc.; and finally a type variable which considers what type of shopper they are, e.g. weekly, monthly etc. Customers can have multiple identities and types, but only one ID.

From my dataset, I wish to correlate every identity with every type for the purposes of doing counts and heatmaps on this. As such, I need to break out and unpivot the data into an attribute:value pair format while maintaining the customer ID. I'm scratching my head on how to even approach this so am not sure what documentation to read, nor can provide any coding attempts.

Here is a reproducible example of my data and a sample of my expected output. Any advice is welcomed.

import pandas as pd
import numpy as np
    
data = {'ID':['1','1','1','2','2','2','3','3','3','3'],
                'Identity': ['Identity_1', 'Identity_2','Identity_3','Identity_1', 'Identity_4','Identity_2','Identity_4', 'Identity_5','Identity_6',np.nan],
                'Type': ['Type_1', 'Type_2',np.nan,'Type_3', 'Type_1',np.nan,'Type_4','Type_5','Type_1', 'Type_6']
                }
        
df_data = pd.DataFrame (data, columns = ['ID','Identity','Type'])
        
result ={'ID':['1','1','1','1','1','1'],
                'Identity': ['Identity_1', 'Identity_1','Identity_2','Identity_2', 'Identity_3','Identity_3'],
                'Type': ['Type_1', 'Type_2','Type_1', 'Type_2','Type_1', 'Type_2']
                }
        
df_result = pd.DataFrame (result, columns = ['ID','Identity','Type'])

Upvotes: 1

Views: 95

Answers (1)

Chris
Chris

Reputation: 29742

One way using itertools.product:

from itertools import product

def prod(data):
    return pd.DataFrame(list(product(data["Identity"].dropna(), 
                                     data["Type"].dropna())), 
                        columns=["Identity", "Type"])

new_df = df_data.groupby("ID").apply(prod).reset_index(level=1, drop=True)
print(new_df)

Output:

      Identity    Type
ID                    
1   Identity_1  Type_1
1   Identity_1  Type_2
1   Identity_2  Type_1
1   Identity_2  Type_2
1   Identity_3  Type_1
1   Identity_3  Type_2
2   Identity_1  Type_3
2   Identity_1  Type_1
2   Identity_4  Type_3
2   Identity_4  Type_1
2   Identity_2  Type_3
2   Identity_2  Type_1
3   Identity_4  Type_4
3   Identity_4  Type_5
3   Identity_4  Type_1
3   Identity_4  Type_6
3   Identity_5  Type_4
3   Identity_5  Type_5
3   Identity_5  Type_1
3   Identity_5  Type_6
3   Identity_6  Type_4
3   Identity_6  Type_5
3   Identity_6  Type_1
3   Identity_6  Type_6

Upvotes: 1

Related Questions