Reputation: 83
I have a dataframe in the form:
date user type
01/01/2021 Bob A
02/01/2021 Lina test
02/01/2021 Bob test
06/01/2020 Bob C
08/01/2020 Lina A
08/01/2020 Lina A
08/01/2020 Bob test
09/01/2020 Lina test
I want to mutate the said dataframe so that it contains only test type rows moving all the other type values as a new columns with the count before the current test row so that I would getting something like:
date user type A C
02/01/2021 Lina test 0 0
02/01/2021 Bob test 1 0
08/01/2020 Bob test 1 1
09/01/2020 Lina test 2 0
I'm trying to use pd.crosstab
:
pd.crosstab([df['user'],df['date']], df['type'])
But I get this :
User date A C test
Bob 01/01/2021 1 0 0
02/01/2021 0 0 1
06/01/2020 0 1 0
08/01/2020 0 0 1
Lina 02/01/2021 0 0 1
08/01/2020 1 0 0
08/01/2020 1 0 0
09/01/2020 0 0 1
How can I sum A and B value before each 1 in the test column?
Upvotes: 0
Views: 55
Reputation: 78
To achieve something along the lines of what you are looking for, you need to group by user
on your crosstab and then apply the cumsum
aggregation function.
ct = pd.crosstab([df.user, df.date], df.type)
ctgb = ct.groupby("user").cumsum()
User date A C test
Bob 01/01/2021 1 0 0
02/01/2021 1 0 1
06/01/2020 1 1 0
08/01/2020 1 1 1
Lina 02/01/2021 0 0 1
08/01/2020 2 0 0
09/01/2020 2 0 2
Then to find out the sum of type A and C up until a test day, apply a mask filtering out days in which test == 1
in the crosstab
frame.
cumulative_test_scores = ctgb.loc[ct.test == 1, ["A", "C"]]
User date A C
Bob 02/01/2021 1 0
08/01/2020 1 1
Lina 02/01/2021 0 0
09/01/2020 2 0
It's not the exact same format you desired however I believe it makes more sense than having the redundant type
column and illustrates the results are for cumulative scores are for each user.
To get your desire format:
cumulative_test_scores.reset_index(level=0, inplace=True)
cumulative_test_scores.columns.name = None
cumulative_test_scores.insert(1, 'type', 'test')
Upvotes: 1