Reputation: 37
I'm trying to build a counter to keep track of the number of failures and successes regarding several different users. I have a dataframe with user codes that repeat themselves (if there are more events about the same users) and a timestamp to track the time variable. I'd like to add two columns (# of successes, # of failures) that cumulate the preceding events' result.
Example data:
data=pd.DataFrame(
{
'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
'status': ['yes','no','yes','no', 'yes', 'yes','yes','no','no','yes','no','yes']
}
)
I tried with some loops in R but I fear I'm missing something, maybe there's a better way to do this in Python?
The desired outcome would be something like this:
data=pd.DataFrame(
{
'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
'status': ['yes','no','yes','no', 'yes', 'yes','yes','no','no','yes','no','yes'],
'number_yes':[1,1,1,1,1,1,2,2,0,1,1,1],
'number_no':[0,1,0,2,0,0,0,1,1,1,2,0]
}
)
Upvotes: 1
Views: 52
Reputation: 153510
Let's use get_dummies
:
data.join(data['status'].str.get_dummies()
.groupby(data['user_id']).cumsum()
.add_prefix('Number_'))
Output:
user_id timestamp status Number_no Number_yes
0 2 1567641600 yes 0 1
1 2 1567691600 no 1 1
2 3 1567741600 yes 0 1
3 2 1567941600 no 2 1
4 4 1567981600 yes 0 1
5 5 1567991600 yes 0 1
6 3 1568391600 yes 0 2
7 3 1568541600 no 1 2
8 6 1568741600 no 1 0
9 6 1568941600 yes 1 1
10 6 1568981600 no 2 1
11 7 1568988600 yes 0 1
What I like about using str.get_dummies
is that this will handle more than just 'yes' and 'no', let us insert a new status 'maybe':
data=pd.DataFrame(
{
'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
'status': ['yes','no','yes','no', 'maybe', 'yes','yes','no','maybe','yes','no','yes']
})
data.join(data['status'].str.get_dummies()
.groupby(data['user_id']).cumsum()
.add_prefix('Number_'))
Output:
user_id timestamp status Number_maybe Number_no Number_yes
0 2 1567641600 yes 0 0 1
1 2 1567691600 no 0 1 1
2 3 1567741600 yes 0 0 1
3 2 1567941600 no 0 2 1
4 4 1567981600 maybe 1 0 0
5 5 1567991600 yes 0 0 1
6 3 1568391600 yes 0 0 2
7 3 1568541600 no 0 1 2
8 6 1568741600 maybe 1 0 0
9 6 1568941600 yes 1 0 1
10 6 1568981600 no 1 1 1
11 7 1568988600 yes 0 0 1
Upvotes: 1
Reputation: 71707
Use, Series.eq
to create a boolean mask, then use Series.groupby
, on this mask and transform the grouped series using .cumsum
:
m = data['status'].eq('yes')
data = data.assign(
number_yes=m.groupby(data['user_id']).cumsum(),
number_no=(~m).groupby(data['user_id']).cumsum()
)
# print(data)
user_id timestamp status number_yes number_no
0 2 1567641600 yes 1.0 0.0
1 2 1567691600 no 1.0 1.0
2 3 1567741600 yes 1.0 0.0
3 2 1567941600 no 1.0 2.0
4 4 1567981600 yes 1.0 0.0
5 5 1567991600 yes 1.0 0.0
6 3 1568391600 yes 2.0 0.0
7 3 1568541600 no 2.0 1.0
8 6 1568741600 no 0.0 1.0
9 6 1568941600 yes 1.0 1.0
10 6 1568981600 no 1.0 2.0
11 7 1568988600 yes 1.0 0.0
Upvotes: 2
Reputation: 30669
data['number_yes'] = data.groupby('user_id').status.transform(lambda x: (x == 'yes').cumsum())
data['number_no'] = data.groupby('user_id').status.transform(lambda x: (x == 'no').cumsum())
Result:
user_id timestamp status number_yes number_no
0 2 1567641600 yes 1 0
1 2 1567691600 no 1 1
2 3 1567741600 yes 1 0
3 2 1567941600 no 1 2
4 4 1567981600 yes 1 0
5 5 1567991600 yes 1 0
6 3 1568391600 yes 2 0
7 3 1568541600 no 2 1
8 6 1568741600 no 0 1
9 6 1568941600 yes 1 1
10 6 1568981600 no 1 2
11 7 1568988600 yes 1 0
Upvotes: 2