Reputation: 923
I have a data-frame like below and I want to create a new column based a column
Condition
If two or multiple rows contain the same value for Column A and Column B and Column C is unique then row should be assigned unique group name
Input
A B C
Payment Mark 2
Payment Mark 3
Delayed Charles 2
Delayed Charles 4
Held Mark 3
Held Charles 4
Payment Mark 2
Payment Mark 3
Delayed Charles 4
Held Charles 4
Expected Output
A B C Cluster
Payment Mark 2 C1
Payment Mark 3 C2
Delayed Charles 2 C3
Delayed Charles 4 C4
Held Mark 3 C5
Held Charles 4 C6
Payment Mark 2 C1
Payment Mark 3 C2
Delayed Charles 4 C4
Held Charles 4 C6
How can this be done in python
Upvotes: 3
Views: 1009
Reputation: 75080
This is a factorize
problem .
One way would be join the columns into a single series and factorize
, add
1 and add
C
in the beginning:
df['Cluster'] = (pd.Series(df[['A','B','C']].astype(str).agg(''.join,1)
.factorize()[0]).add(1).astype(str).radd('C'))
print(df)
Another way would be similar but with groupby.ngroup
df['Cluster'] = (df.groupby(['A','B','C'],sort=False).ngroup()
.add(1).astype(str).radd('C'))
A B C Cluster
0 Payment Mark 2 C1
1 Payment Mark 3 C2
2 Delayed Charles 2 C3
3 Delayed Charles 4 C4
4 Held Mark 3 C5
5 Held Charles 4 C6
6 Payment Mark 2 C1
7 Payment Mark 3 C2
8 Delayed Charles 4 C4
9 Held Charles 4 C6
Upvotes: 4