Reputation: 45
Input Table
index | income | Education | age1to_20 | pcd |
---|---|---|---|---|
1 | income_1 | Education_0 | 1 | A5009 |
2 | income_2 | Education_2 | 1 | A3450 |
3 | income_1 | Education_0 | 1 | A5009 |
4 | income_3 | Education_1 | 0 | A3450 |
How do I convert this table into
index | income_1 | income_2 | INCOME_3 | Education_0 | Education_1 | Education_2 | age1to_20 |
---|---|---|---|---|---|---|---|
1 | A5009 | 0 | 0 | A5009 | 0 | 0 | A5009 |
2 | 0 | A3450 | 0 | 0 | 0 | A3450 | A3450 |
3 | A5009 | 0 | 0 | A5009 | 0 | 0 | A5009 |
4 | 0 | 0 | A3450 | 0 | A3450 | 0 | 0 |
UPDATED THE OUTPUT TABLE
Upvotes: 1
Views: 125
Reputation: 13582
pandas.get_dummies
allows us to get almost the desired output, but instead of getting the values as the pcd
column, one has 0
or 1
for a given pcd
import pandas as pd
df2 = pd.get_dummies(df, columns=['income', 'Education'])
[Out]:
index age1to_20 ... Education_Education_1 Education_Education_2
0 1 1 ... 0 0
1 2 1 ... 0 1
2 3 1 ... 0 0
3 4 0 ... 1 0
Then, in order to change the 1
s with the respective pcd
value (and drop the column pcd
at the end), one can use pandas.Series.mask
and pandas.DataFrame.drop
as follows
cols = df2.columns.drop(['index', 'pcd']) # To select the columns one wants to apply the change
df2[cols] = df2[cols].mask(df2[cols] == 1, df2['pcd'], axis=0)
df2 = df2.drop(['pcd'], axis=1)
[Out]:
index age1to_20 ... Education_Education_1 Education_Education_2
0 1 A5009 ... 0 0
1 2 A3450 ... 0 A3450
2 3 A5009 ... 0 0
3 4 0 ... A3450 0
Notes:
An alternative to change the 1
s to the respective pcd
and drop the column pcd
would be to use pandas.DataFrame.assign
, numpy.where
and pandas.DataFrame.drop
as follows
import numpy as np
df2 = df2.assign(income_1 = np.where(df2['income_income_1'] == 1, df2['pcd'], 0),
income_2 = np.where(df2['income_income_2'] == 1, df2['pcd'], 0),
income_3 = np.where(df2['income_income_3'] == 1, df2['pcd'], 0),
Education_0 = np.where(df2['Education_Education_0'] == 1, df2['pcd'], 0),
Education_1 = np.where(df2['Education_Education_1'] == 1, df2['pcd'], 0),
Education_2 = np.where(df2['Education_Education_2'] == 1, df2['pcd'], 0),
Age1to20 = np.where(df2['age1to_20'] == 1, df2['pcd'], 0)).drop(['age1to_20', 'income_income_1', 'income_income_2', 'income_income_3', 'Education_Education_0', 'Education_Education_1', 'Education_Education_2', 'pcd'], axis=1)
[Out]:
index income_1 income_2 ... Education_1 Education_2 Age1to20
0 1 A5009 0 ... 0 0 A5009
1 2 0 A3450 ... 0 A3450 A3450
2 3 A5009 0 ... 0 0 A5009
3 4 0 0 ... A3450 0 0
There are additional ways to select the columns one wants to consider in a given dataframe. For that this might be useful: Selecting multiple columns in a Pandas dataframe
Upvotes: 0
Reputation: 260380
cols_to_pivot = ['income', 'Education']
cols = df.columns.difference(cols_to_pivot)
(df.melt(cols).drop(columns='variable')
.pivot(index=['index', 'age1to_20'], columns='value', values='pcd')
.fillna(0).reset_index()
)
output:
value index age1to_20 Education_0 Education_1 Education_2 income_1 income_2 income_3
0 1 1 A5009 0 0 A5009 0 0
1 2 1 0 0 A3450 0 A3450 0
2 3 1 A5009 0 0 A5009 0 0
3 4 0 0 A3450 0 0 0 A3450
cols_to_pivot = ['income', 'Education']
cols = df.columns.difference(cols_to_pivot)
(df.assign(age1to_20=df['pcd'].where(df['age1to_20'].eq(1), 0))
.melt(cols).drop(columns='variable')
.pivot(index=['index', 'age1to_20'], columns='value', values='pcd')
.fillna(0).reset_index()
)
output:
value index age1to_20 Education_0 Education_1 Education_2 income_1 income_2 income_3
0 1 A5009 A5009 0 0 A5009 0 0
1 2 A3450 0 0 A3450 0 A3450 0
2 3 A5009 A5009 0 0 A5009 0 0
3 4 0 0 A3450 0 0 0 A3450
Upvotes: 2
Reputation: 28644
One option is to pivot once, run a for loop across the dataframe to get the individual dataframes, and concatenate across the columns, to get the final output:
out = df.pivot(index=['index', 'age1to_20'],
columns=['income', 'Education'],
values = 'pcd')
out = [out.droplevel(axis=1,level=n)
for n in range(out.columns.nlevels)]
pd.concat(out, axis = 1).fillna(0, downcast='infer').reset_index()
index age1to_20 Education_0 Education_2 Education_1 income_1 income_2 income_3
0 1 1 A5009 0 0 A5009 0 0
1 2 1 0 A3450 0 0 A3450 0
2 3 1 A5009 0 0 A5009 0 0
3 4 0 0 0 A3450 0 0 A3450
Upvotes: 1
Reputation: 25323
Another possible solution:
(pd.concat([
df.pivot(index=['index', 'age1to_20'], columns=['income'], values='pcd'),
df.pivot(index=['index', 'age1to_20'], columns=['Education'], values='pcd')], axis=1)
.fillna(0).reset_index())
Output:
index age1to_20 income_1 income_2 income_3 Education_0 Education_1 Education_2
0 1 1 A5009 0 0 A5009 0 0
1 2 1 0 A3450 0 0 0 A3450
2 3 1 A5009 0 0 A5009 0 0
3 4 0 0 0 A3450 0 A3450 0
EDIT
In case there are a lot of columns to pivot, the following code does that, by iterating over the list of columns to pivot with map
:
cols = ['income', 'Education']
(pd.concat(
map(lambda x: df.pivot(
index=['index', 'age1to_20'], columns=x, values='pcd'), cols), axis=1)
.fillna(0).reset_index())
Upvotes: 4