Reputation: 1636
My DataFrame looks like this,
df =
index | A | B | C
0 |00456|text1|date1
1 |00443|text1|date2
2 |00456|text1|date3
3 |00231|text2|date4
4 |00231|text3|date1
5 |00456|text2|date1
I want to end up with the following,
df =
index | Α | B_1 |B_2 |B_3 |C_1...
0 |00456|text1|text1|text2|...
1 |00443|text1|NaN |NaN |...
2 |00231|text2|text3|NaN |...
It is not important having the _1
suffix or the NaN
values, what I want to achieve is have all the values in the same line based on the contents of column A.
What is the best way to achieve that?
Note that I have done this using a what I think is an over-complicated method involving operations that look like,
groups = df.groupby(["A"])
df = pd.concat((groups.A.apply(lambda x: pd.Series(data=x.values)).unstack(),
groups.B.apply(lambda x: pd.Series(data=x.values)).unstack(),
groups.C.apply(lambda x: pd.Series(data=x.values)).unstack()),
keys = ['A', 'B' 'C'], axis=1)
Upvotes: 1
Views: 44
Reputation: 863166
Use GroupBy.cumcount
for counter with reshape by DataFrame.set_index
and DataFrame.unstack
, last flatten MultiIndex
in columns and convert index to column:
g = df.groupby('A').cumcount() + 1
df1 = df.set_index(['A', g]).unstack()
df1.columns = [f'{a}_{b}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
A B_1 B_2 B_3 C_1 C_2 C_3
0 231 text2 text3 NaN date4 date1 NaN
1 443 text1 NaN NaN date2 NaN NaN
2 456 text1 text1 text2 date1 date3 date1
Upvotes: 1