Nick
Nick

Reputation: 3134

Pivoting a Pandas Dataframe on Categorical Variables

I have a dataframe containing categorical variables:

{'SysID': {0: '00721778',
1: '00721778',
2: '00721778',
3: '00721779',
4: '00721779'},
'SoftwareComponent': {0: 'AA13912',
1: 'AA24120',
2: 'AA21612',
3: 'AA30861',
4: 'AA20635'},
'SoftwareSubcomponent': {0: None,
1: 'AK21431',
2: None,
3: 'AK22116',
4: None}}

I would like to pivot on the categorical variables by ignoring any NULL values. Zero should be the filler. The output should look like this:

{'SysID': {0: '00721778', 1: '00721779'},
'SoftwareCom-AA13912': {0: '1', 1: '0'},
'SoftwareCom-AA24120': {0: '1', 1: '0'},
'SoftwareCom-AA21612': {0: '1', 1: '0'},
'SoftwareCom-AA30861': {0: '0', 1: '1'},
'SoftwareCom-AA20635': {0: '0', 1: '1'},
'SoftwareSub-AK21431': {0: '1', 1: '0'},
'SoftwareSub-AK22116': {0: '0', 1: '1'}}

How to do this?

Upvotes: 1

Views: 1530

Answers (2)

rahlf23
rahlf23

Reputation: 9019

You can use pd.crosstab() and then rename your dataframe columns prior to using pd.concat():

df1 = pd.crosstab(df['SysID'], df['SoftwareComponent'])
df1.columns = [df1.columns.name + '-' + i for i in df1.columns]
df2 = pd.crosstab(df['SysID'], df['SoftwareSubcomponent'])
df2.columns = [df2.columns.name + '-' + i for i in df2.columns]
final = pd.concat([df1, df2], axis=1)

Yields:

          SoftwareComponent-AA13912  SoftwareComponent-AA20635  \
SysID                                                            
00721778                          1                          0   
00721779                          0                          1   

          SoftwareComponent-AA21612  SoftwareComponent-AA24120  \
SysID                                                            
00721778                          1                          1   
00721779                          0                          0   

          SoftwareComponent-AA30861  SoftwareSubcomponent-AK21431  \
SysID                                                               
00721778                          0                             1   
00721779                          1                             0   

          SoftwareSubcomponent-AK22116  
SysID                                   
00721778                             0  
00721779                             1 

Using to_dict(), you can return:

{'SoftwareComponent-AA13912': {'00721778': 1, '00721779': 0}, 'SoftwareComponent-AA20635': {'00721778': 0, '00721779': 1}, 'SoftwareComponent-AA21612': {'00721778': 1, '00721779': 0}, 'SoftwareComponent-AA24120': {'00721778': 1, '00721779': 0}, 'SoftwareComponent-AA30861': {'00721778': 0, '00721779': 1}, 'SoftwareSubcomponent-AK21431': {'00721778': 1, '00721779': 0}, 'SoftwareSubcomponent-AK22116': {'00721778': 0, '00721779': 1}}

Upvotes: 1

ALollz
ALollz

Reputation: 59549

You can use pd.crosstab after doing a bit of cleanup. We will stack (which will ignore all of the None values) and create the column names as you want to treat SofwareCom and SoftwareSub the same.

import pandas as pd

df = df.set_index('SysID').stack().reset_index(level=1)
df['val'] = df['level_1'].str[0:11] + '-' + df[0]

pd.crosstab(df.index, df.val).rename_axis('SysID', 0).rename_axis(None,1).reset_index()

Output:

      SysID  SoftwareCom-AA13912  SoftwareCom-AA20635  SoftwareCom-AA21612  SoftwareCom-AA24120  SoftwareCom-AA30861  SoftwareSub-AK21431  SoftwareSub-AK22116
0  00721778                    1                    0                    1                    1                    0                    1                    0
1  00721779                    0                    1                    0                    0                    1                    0                    1

If you have the possibility of having multiple counts and just want 1s and 0s, then you can either typecast to bool, then back to int, or just use .clip

pd.crosstab(df.index, df.val).rename_axis('SysID', 0).rename_axis(None,1).clip(0,1).reset_index()

Upvotes: 1

Related Questions