Reputation: 25
I have the following dataframe with the following format:
df = pd.DataFrame({'Date': np.repeat(['20190101', '20190102'], 8),
'Category': list(np.repeat(['A', 'B'], 4))*2,
'Sub-Category': list(np.repeat(['X', 'Y'], 2))*4,
'Samples': range(16)})
# print(df)
Date Category Sub-Category Samples
0 20190101 A X 0
1 20190101 A X 1
2 20190101 A Y 2
3 20190101 A Y 3
4 20190101 B X 4
5 20190101 B X 5
6 20190101 B Y 6
7 20190101 B Y 7
8 20190102 A X 8
9 20190102 A X 9
10 20190102 A Y 10
11 20190102 A Y 11
12 20190102 B X 12
13 20190102 B X 13
14 20190102 B Y 14
15 20190102 B Y 15
Within every date, there is a category and within each category there are multiple sub-categories. Each sub-category, moreover, has multiple samples.
What I want to do is to convert the 'Sub-Category' column from a long format to a wide format without doing any aggregation across the samples like this:
desired_df = pd.DataFrame({'Date': np.repeat(['20190101', '20190102'], 4),
'Category': list(np.repeat(['A', 'B'], 2))*2,
'X': [0, 1, 4, 5, 8, 9, 12, 13],
'Y': [2, 3, 6, 7, 10, 11, 14, 15]
})
# print(desired_df)
Date Category X Y
0 20190101 A 0 2
1 20190101 A 1 3
2 20190101 B 4 6
3 20190101 B 5 7
4 20190102 A 8 10
5 20190102 A 9 11
6 20190102 B 12 14
7 20190102 B 13 15
I think I have figured out a way to achieve this, but it feels quite ugly and hacky: it involves first creating a new column for the sample number, then creating a new index that combines all of that information together, pivoting, and then separating it all out again:
def add_sample_index(df):
df['sample_index'] = range(len(df))
return df
new_df = df.groupby(['Date', 'Category', 'Sub-Category']).apply(add_sample_index)
new_df['new_index'] = new_df['Date'] + '_' + new_df['Category'] + '_' + new_df['sample_index'].astype(str)
wide_df = new_df.pivot(index='new_index', columns='Sub-Category', values='Samples').reset_index()
wide_df['Date'], wide_df['Category'], wide_df['sample_index'] = wide_df['new_index'].str.split('_').str
wide_df.drop(['new_index', 'sample_index'],
inplace=True, axis='columns')
final_df = wide_df[['Date', 'Category', 'X', 'Y']]
# print((final_df == desired_df).all().all())
True
My question: is there a better, more efficient way of solving this problem?
Upvotes: 1
Views: 901
Reputation: 323226
First we using groupby
cumcount
create the additional key, then it is more like pivot
problem
df['New']=df.groupby(['Date','Category','Sub-Category']).cumcount()
yourdf=df.pivot_table(values='Samples',index=['Date','Category','New'],columns='Sub-Category').reset_index()
yourdf
Out[703]:
Sub-Category Date Category New X Y
0 20190101 A 0 0 2
1 20190101 A 1 1 3
2 20190101 B 0 4 6
3 20190101 B 1 5 7
4 20190102 A 0 8 10
5 20190102 A 1 9 11
6 20190102 B 0 12 14
7 20190102 B 1 13 15
Upvotes: 5