Reputation: 142
I have a dataset in this general format:
group_id item_id zone time
1 1 1 1
1 3 4 1
2 6 3 1
2 2 4 1
3 1 1 1
1 2 3 2
1 6 2 2
2 8 1 2
2 5 2 2
3 1 1 2
I am looking to create a new column for each potential value in the zone column, then assign it a truth value if it represents that row. Say there were four potential zones, listed in names = ['zone_1', 'zone_2', 'zone_3', 'zone_4']
. The output should be something like this:
group_id item_id time zone_1 zone_2 zone_3 zone_4
1 1 1 1 0 0 0
1 3 1 0 0 0 1
2 6 1 0 0 1 0
2 2 1 0 0 0 1
3 1 1 1 0 0 0
1 2 2 0 0 1 0
1 6 2 0 1 0 0
2 8 2 1 0 0 0
2 5 2 0 1 0 0
3 1 2 1 0 0 0
I'm not sure how best to achieve this. I can create the new columns directly from the name list. I could create an array in place for the zone (e.g. zone value 3 to be replaced with [0,0,1,0]), then pd.explode on 'zone' followed by a pivot, but replacing line by line is intensive and not very pandas-like. I appreciate any other suggestions.
It may be important to note that group_id + item_id + time create a unique item (there are no two items with the same on all three values).
Thank you!
EDIT:
To clarify, as I was unclear: there may be more potential zones than appear uniquely in the column. For example, the above dataset could be paired with a names = ['zone_1', 'zone_2', 'zone_3', 'zone_4', 'zone_5]
. Even though there is no instance where zone 5 appears in the dataframe, it also needs a new column. Ideally, a solution will also create a column for this (for later visualization purposes). The output in this case would be:
group_id item_id time zone_1 zone_2 zone_3 zone_4 zone_5
1 1 1 1 0 0 0 0
1 3 1 0 0 0 1 0
2 6 1 0 0 1 0 0
2 2 1 0 0 0 1 0
3 1 1 1 0 0 0 0
1 2 2 0 0 1 0 0
1 6 2 0 1 0 0 0
2 8 2 1 0 0 0 0
2 5 2 0 1 0 0 0
3 1 2 1 0 0 0 0
Upvotes: 0
Views: 43
Reputation: 29742
IIUC, using pandas.get_dummies
:
zones = df.pop("zone")
new_df = pd.concat([df, pd.get_dummies(zones, prefix="zone")], 1)
print(new_df)
Output:
group_id item_id time zone_1 zone_2 zone_3 zone_4
0 1 1 1 1 0 0 0
1 1 3 1 0 0 0 1
2 2 6 1 0 0 1 0
3 2 2 1 0 0 0 1
4 3 1 1 1 0 0 0
5 1 2 2 0 0 1 0
6 1 6 2 0 1 0 0
7 2 8 2 1 0 0 0
8 2 5 2 0 1 0 0
9 3 1 2 1 0 0 0
Upvotes: 1
Reputation: 323306
Let us do crosstab
df=df.join(pd.crosstab(df.index,df.zone).add_prefix('zone_'))
df
group_id item_id zone time zone_1 zone_2 zone_3 zone_4
0 1 1 1 1 1 0 0 0
1 1 3 4 1 0 0 0 1
2 2 6 3 1 0 0 1 0
3 2 2 4 1 0 0 0 1
4 3 1 1 1 1 0 0 0
5 1 2 3 2 0 0 1 0
6 1 6 2 2 0 1 0 0
7 2 8 1 2 1 0 0 0
8 2 5 2 2 0 1 0 0
9 3 1 1 2 1 0 0 0
Upvotes: 1