Reputation: 817
Let's say I have data in the following format:
group_id | entity_id | value
A a1 5
A a2 3
A a3 2
B b1 10
B b2 8
B b3 11
C c1 2
C c2 6
C c3 NaN
Table 1.
So each group (A/B/C) will have 3 entities, guaranteed. And each entity has a corresponding value (sometimes NaN if non-existent).
I want to make reshape this data from the existing format to...:
group_id | entity_1 | entity_2 | entity_3
A 5 3 2
B 10 8 11
C 2 6 NaN
Table 2.
Where entity_1/entity_2/entity_3 correspond to a1/a2/a3 (or b1/b2/b3, c1/c2/c3) respectively.
How do I do this?
One solution I found was to use the pivot function so...
df.pivot(index='group_id', columns='entity_id', values='value')
But as I understand it, the problem with this is that the columns for the entities in the resulting reshaped pivot table will not be in the format I wanted above in Table 2 -- this is important for some downstream stuff I'm doing with the data.
I might be asking a stupid question but I had trouble finding ways to use the existing pivot/melt functions to go from long to wide in the way that I described above. Can anyone help me out?
I'm happy to provide more details if necessary, just let me know!
Upvotes: 6
Views: 4854
Reputation: 862541
You can use pivot
and new columns are last value of column entity_id
extracted by indexing with str:
df = pd.pivot(index=df.group_id, columns=df.entity_id.str[-1], values=df.value)
.add_prefix('entity_')
.rename_axis(None, axis=1)
.reset_index()
print (df)
group_id entity_1 entity_2 entity_3
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
Solution with cumcount
:
df = pd.pivot(index=df.group_id,
columns=df.groupby('group_id').cumcount() + 1,
values=df.value)
.add_prefix('entity_')
.reset_index()
print (df)
group_id entity_1 entity_2 entity_3
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
Another solution with groupby
and apply
, last reshape by unstack
:
df = df.groupby("group_id")["value"]
.apply(lambda x: pd.Series(x.values))
.unstack()
.add_prefix('entity_')
.reset_index()
print (df)
group_id entity_0 entity_1 entity_2
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
If need count from 1
:
df = df.groupby("group_id")["value"].apply(lambda x: pd.Series(x.values))
.unstack()
.rename(columns = lambda x: x+1)
.add_prefix('entity_')
.reset_index()
print (df)
group_id entity_1 entity_2 entity_3
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
Upvotes: 6