Reputation: 3134
I have a dataframe containing categorical variables:
{'SysID': {0: '00721778',
1: '00721778',
2: '00721778',
3: '00721779',
4: '00721779'},
'SoftwareComponent': {0: 'AA13912',
1: 'AA24120',
2: 'AA21612',
3: 'AA30861',
4: 'AA20635'},
'SoftwareSubcomponent': {0: None,
1: 'AK21431',
2: None,
3: 'AK22116',
4: None}}
I would like to pivot on the categorical variables by ignoring any NULL values. Zero should be the filler. The output should look like this:
{'SysID': {0: '00721778', 1: '00721779'},
'SoftwareCom-AA13912': {0: '1', 1: '0'},
'SoftwareCom-AA24120': {0: '1', 1: '0'},
'SoftwareCom-AA21612': {0: '1', 1: '0'},
'SoftwareCom-AA30861': {0: '0', 1: '1'},
'SoftwareCom-AA20635': {0: '0', 1: '1'},
'SoftwareSub-AK21431': {0: '1', 1: '0'},
'SoftwareSub-AK22116': {0: '0', 1: '1'}}
How to do this?
Upvotes: 1
Views: 1530
Reputation: 9019
You can use pd.crosstab()
and then rename your dataframe columns prior to using pd.concat()
:
df1 = pd.crosstab(df['SysID'], df['SoftwareComponent'])
df1.columns = [df1.columns.name + '-' + i for i in df1.columns]
df2 = pd.crosstab(df['SysID'], df['SoftwareSubcomponent'])
df2.columns = [df2.columns.name + '-' + i for i in df2.columns]
final = pd.concat([df1, df2], axis=1)
Yields:
SoftwareComponent-AA13912 SoftwareComponent-AA20635 \
SysID
00721778 1 0
00721779 0 1
SoftwareComponent-AA21612 SoftwareComponent-AA24120 \
SysID
00721778 1 1
00721779 0 0
SoftwareComponent-AA30861 SoftwareSubcomponent-AK21431 \
SysID
00721778 0 1
00721779 1 0
SoftwareSubcomponent-AK22116
SysID
00721778 0
00721779 1
Using to_dict()
, you can return:
{'SoftwareComponent-AA13912': {'00721778': 1, '00721779': 0}, 'SoftwareComponent-AA20635': {'00721778': 0, '00721779': 1}, 'SoftwareComponent-AA21612': {'00721778': 1, '00721779': 0}, 'SoftwareComponent-AA24120': {'00721778': 1, '00721779': 0}, 'SoftwareComponent-AA30861': {'00721778': 0, '00721779': 1}, 'SoftwareSubcomponent-AK21431': {'00721778': 1, '00721779': 0}, 'SoftwareSubcomponent-AK22116': {'00721778': 0, '00721779': 1}}
Upvotes: 1
Reputation: 59549
You can use pd.crosstab
after doing a bit of cleanup. We will stack (which will ignore all of the None
values) and create the column names as you want to treat SofwareCom and SoftwareSub the same.
import pandas as pd
df = df.set_index('SysID').stack().reset_index(level=1)
df['val'] = df['level_1'].str[0:11] + '-' + df[0]
pd.crosstab(df.index, df.val).rename_axis('SysID', 0).rename_axis(None,1).reset_index()
SysID SoftwareCom-AA13912 SoftwareCom-AA20635 SoftwareCom-AA21612 SoftwareCom-AA24120 SoftwareCom-AA30861 SoftwareSub-AK21431 SoftwareSub-AK22116
0 00721778 1 0 1 1 0 1 0
1 00721779 0 1 0 0 1 0 1
If you have the possibility of having multiple counts and just want 1s and 0s, then you can either typecast to bool, then back to int, or just use .clip
pd.crosstab(df.index, df.val).rename_axis('SysID', 0).rename_axis(None,1).clip(0,1).reset_index()
Upvotes: 1