Reputation: 971
I have a df where there are two columns uid and p, I want to add a new column to the existing df or create an entirely new df with the additional column whose values are based on the values of column "uid" and list of indices x:
x = [2, 9, 12]
x contains the indices where I am supposed to generate new ids, the new ids are increment of the previous ones. So, there will be two cases, whenever the index from the list x is encountered, new id is generated and whenever the ids in uid column changes then new id is again generated as shown below:
uid expected_newid p
0 1 1 10
1 1 1 23
2 1 2 20 #new id generated at index 2
3 1 2 40
4 2 3 21 #newid generated as "uid" changes
5 2 3 89
6 2 3 45
7 3 4 50
8 3 4 32
9 3 5 76 #new id generated at index 9
10 3 5 71
11 3 5 90
12 3 6 56 #new id generated at index 12
13 3 6 87
Please let me know if anything is unclear.
I could manage the case to generate new id whenever uid is changing with the following code
df['newid'] = (df.uid.diff() != 0).cumsum()
But it should also generate the newid at the indices mentioned in the list x as shown in the column "expected_newid"
Upvotes: 5
Views: 4143
Reputation: 353499
IIUC, you can simply extend the condition you're currently using to include the possibility that the index is in x
using an "or" (here written as |
):
In [12]: df["newid"] = ((df.uid.diff() != 0) | (df.index.isin(x))).cumsum()
In [13]: df
Out[13]:
uid expected_newid p newid
0 1 1 10 1
1 1 1 23 1
2 1 2 20 2
3 1 2 40 2
4 2 3 21 3
5 2 3 89 3
6 2 3 45 3
7 3 4 50 4
8 3 4 32 4
9 3 5 76 5
10 3 5 71 5
11 3 5 90 5
12 3 6 56 6
13 3 6 87 6
Upvotes: 2
Reputation: 9963
How about this:
df = pd.DataFrame({'uid': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3]})
# Add column uidd that is a 1 (True) when the existing id changes
df['uidd'] = (df.uid.diff() != 0)
# Add a column that is 0 except at the indices in your list
# where it is 1
df['byidx'] = 0
df.loc[[2, 9, 12], 'byidx'] = 1
# now combine them so we get a 1 where either has changed
df['both'] = df.uidd + byidx
# And finally, cumsum will generate the correct ids
df['newuid'] = df.both.cumsum()
The result is:
uid uidd byidx both newuid
0 1 True 0 1 1
1 1 False 0 0 1
2 1 False 1 1 2
3 1 False 0 0 2
4 2 True 0 1 3
5 2 False 0 0 3
6 2 False 0 0 3
7 3 True 0 1 4
8 3 False 0 0 4
9 3 False 1 1 5
10 3 False 0 0 5
11 3 False 0 0 5
12 3 False 1 1 6
13 3 False 0 0 6
Naturally you don't need to add the columns uidd
, byidx
, or both
to the data frame and you could combine all those steps. I just thought separating them out like that might make things more clear. You could also add them as above and then simply delete the temporary columns.
Upvotes: 3