ALEX
ALEX

Reputation: 37

Iterative counter based on several variables

I'm trying to build a counter to keep track of the number of failures and successes regarding several different users. I have a dataframe with user codes that repeat themselves (if there are more events about the same users) and a timestamp to track the time variable. I'd like to add two columns (# of successes, # of failures) that cumulate the preceding events' result.

Example data:

data=pd.DataFrame(
    {
        'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
        'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
        'status': ['yes','no','yes','no', 'yes', 'yes','yes','no','no','yes','no','yes']
    }
)

I tried with some loops in R but I fear I'm missing something, maybe there's a better way to do this in Python?

The desired outcome would be something like this:

data=pd.DataFrame(
    {
        'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
        'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
        'status': ['yes','no','yes','no', 'yes', 'yes','yes','no','no','yes','no','yes'],
        'number_yes':[1,1,1,1,1,1,2,2,0,1,1,1],
        'number_no':[0,1,0,2,0,0,0,1,1,1,2,0]
    }
)

Upvotes: 1

Views: 52

Answers (3)

Scott Boston
Scott Boston

Reputation: 153510

Let's use get_dummies:

data.join(data['status'].str.get_dummies()
                        .groupby(data['user_id']).cumsum()
                        .add_prefix('Number_'))

Output:

    user_id   timestamp status  Number_no  Number_yes
0         2  1567641600    yes          0           1
1         2  1567691600     no          1           1
2         3  1567741600    yes          0           1
3         2  1567941600     no          2           1
4         4  1567981600    yes          0           1
5         5  1567991600    yes          0           1
6         3  1568391600    yes          0           2
7         3  1568541600     no          1           2
8         6  1568741600     no          1           0
9         6  1568941600    yes          1           1
10        6  1568981600     no          2           1
11        7  1568988600    yes          0           1

What I like about using str.get_dummies is that this will handle more than just 'yes' and 'no', let us insert a new status 'maybe':

data=pd.DataFrame(
    {
        'user_id': [2,2,3,2,4,5,3,3,6,6,6,7],
        'timestamp': [1567641600,1567691600,1567741600,1567941600, 1567981600, 1567991600,1568391600,1568541600,1568741600,1568941600,1568981600,1568988600],
        'status': ['yes','no','yes','no', 'maybe', 'yes','yes','no','maybe','yes','no','yes']
    })

data.join(data['status'].str.get_dummies()
                        .groupby(data['user_id']).cumsum()
                        .add_prefix('Number_'))

Output:

    user_id   timestamp status  Number_maybe  Number_no  Number_yes
0         2  1567641600    yes             0          0           1
1         2  1567691600     no             0          1           1
2         3  1567741600    yes             0          0           1
3         2  1567941600     no             0          2           1
4         4  1567981600  maybe             1          0           0
5         5  1567991600    yes             0          0           1
6         3  1568391600    yes             0          0           2
7         3  1568541600     no             0          1           2
8         6  1568741600  maybe             1          0           0
9         6  1568941600    yes             1          0           1
10        6  1568981600     no             1          1           1
11        7  1568988600    yes             0          0           1

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71707

Use, Series.eq to create a boolean mask, then use Series.groupby, on this mask and transform the grouped series using .cumsum:

m = data['status'].eq('yes')
data = data.assign(
    number_yes=m.groupby(data['user_id']).cumsum(),
    number_no=(~m).groupby(data['user_id']).cumsum()
)

# print(data)
    user_id   timestamp status  number_yes  number_no
0         2  1567641600    yes         1.0        0.0
1         2  1567691600     no         1.0        1.0
2         3  1567741600    yes         1.0        0.0
3         2  1567941600     no         1.0        2.0
4         4  1567981600    yes         1.0        0.0
5         5  1567991600    yes         1.0        0.0
6         3  1568391600    yes         2.0        0.0
7         3  1568541600     no         2.0        1.0
8         6  1568741600     no         0.0        1.0
9         6  1568941600    yes         1.0        1.0
10        6  1568981600     no         1.0        2.0
11        7  1568988600    yes         1.0        0.0

Upvotes: 2

Stef
Stef

Reputation: 30669

data['number_yes'] = data.groupby('user_id').status.transform(lambda x: (x == 'yes').cumsum())
data['number_no'] = data.groupby('user_id').status.transform(lambda x: (x == 'no').cumsum())

Result:

    user_id   timestamp status  number_yes  number_no
0         2  1567641600    yes           1          0
1         2  1567691600     no           1          1
2         3  1567741600    yes           1          0
3         2  1567941600     no           1          2
4         4  1567981600    yes           1          0
5         5  1567991600    yes           1          0
6         3  1568391600    yes           2          0
7         3  1568541600     no           2          1
8         6  1568741600     no           0          1
9         6  1568941600    yes           1          1
10        6  1568981600     no           1          2
11        7  1568988600    yes           1          0

Upvotes: 2

Related Questions