Reputation: 137
I just want to make the column "status" shown in the table below. This column assigns the label "Complete" or "Incomplete" if the column "question" has the 4 elements ("a","b","c","d") for each pair "user_id" and "date"
df=(pd.DataFrame({'user_id':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4],
'date':["01-01-2020","01-01-2020","01-01-2020","01-01-2020","02-01-2020","02-01-2020","03-01-2020","04-01-2020",
"04-01-2020","04-01-2020","05-01-2020","05-01-2020","06-01-2020","06-01-2020","07-01-2020","08-01-2020",
"08-01-2020","09-01-2020","09-01-2020","09-01-2020","09-01-2020","10-01-2020","10-01-2020","11-01-2020",
"11-01-2020","12-01-2020","12-01-2020"],
'question':["a","b","c","d","a","b","a","a","b","c","a","b","a","b","a","a","b","a","b","c","d","a","b","a","b","a","b"],
'status': ["Complete","Complete","Complete","Complete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete",
"Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Complete","Complete","Complete",
"Complete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete"]}))
display_full(df)
| | user_id | date | question | status |
|---:|----------:|:-----------|:-----------|:-----------|
| 0 | 1 | 01-01-2020 | a | Complete |
| 1 | 1 | 01-01-2020 | b | Complete |
| 2 | 1 | 01-01-2020 | c | Complete |
| 3 | 1 | 01-01-2020 | d | Complete |
| 4 | 1 | 02-01-2020 | a | Incomplete |
| 5 | 1 | 02-01-2020 | b | Incomplete |
| 6 | 1 | 03-01-2020 | a | Incomplete |
| 7 | 2 | 04-01-2020 | a | Incomplete |
| 8 | 2 | 04-01-2020 | b | Incomplete |
| 9 | 2 | 04-01-2020 | c | Incomplete |
| 10 | 2 | 05-01-2020 | a | Incomplete |
| 11 | 2 | 05-01-2020 | b | Incomplete |
| 12 | 2 | 06-01-2020 | a | Incomplete |
| 13 | 2 | 06-01-2020 | b | Incomplete |
| 14 | 3 | 07-01-2020 | a | Incomplete |
| 15 | 3 | 08-01-2020 | a | Incomplete |
| 16 | 3 | 08-01-2020 | b | Incomplete |
| 17 | 3 | 09-01-2020 | a | Complete |
| 18 | 3 | 09-01-2020 | b | Complete |
| 19 | 3 | 09-01-2020 | c | Complete |
| 20 | 3 | 09-01-2020 | d | Complete |
| 21 | 4 | 10-01-2020 | a | Incomplete |
| 22 | 4 | 10-01-2020 | b | Incomplete |
| 23 | 4 | 11-01-2020 | a | Incomplete |
| 24 | 4 | 11-01-2020 | b | Incomplete |
| 25 | 4 | 12-01-2020 | a | Incomplete |
| 26 | 4 | 12-01-2020 | b | Incomplete |
Upvotes: 1
Views: 58
Reputation: 13831
You could use GroupBy.transform
with set
to remove duplicates, and then the len
function to count the resulting unique elements, which will allow us to see if the column "question" has the 4 elements ("a","b","c","d") for each pair "user_id" and "date".
import numpy as np
import pandas as pd
df['temp'] = df.groupby(['user_id','date'])['question'].transform(lambda x: len(set(x)))
df['status_new'] = np.where(df['temp'] <4,'Incomplete','Complete')
df.drop('temp',axis=1,inplace=True)
Will print:
user_id date question status status_new
0 1 01-01-2020 a Complete Complete
1 1 01-01-2020 b Complete Complete
2 1 01-01-2020 c Complete Complete
3 1 01-01-2020 d Complete Complete
4 1 02-01-2020 a Incomplete Incomplete
5 1 02-01-2020 b Incomplete Incomplete
6 1 03-01-2020 a Incomplete Incomplete
7 2 04-01-2020 a Incomplete Incomplete
8 2 04-01-2020 b Incomplete Incomplete
9 2 04-01-2020 c Incomplete Incomplete
10 2 05-01-2020 a Incomplete Incomplete
11 2 05-01-2020 b Incomplete Incomplete
12 2 06-01-2020 a Incomplete Incomplete
13 2 06-01-2020 b Incomplete Incomplete
14 3 07-01-2020 a Incomplete Incomplete
15 3 08-01-2020 a Incomplete Incomplete
16 3 08-01-2020 b Incomplete Incomplete
17 3 09-01-2020 a Complete Complete
18 3 09-01-2020 b Complete Complete
19 3 09-01-2020 c Complete Complete
20 3 09-01-2020 d Complete Complete
21 4 10-01-2020 a Incomplete Incomplete
22 4 10-01-2020 b Incomplete Incomplete
23 4 11-01-2020 a Incomplete Incomplete
24 4 11-01-2020 b Incomplete Incomplete
25 4 12-01-2020 a Incomplete Incomplete
26 4 12-01-2020 b Incomplete Incomplete
Try not dropping the temp column, to clearly see what it does. It will show how many questions were answered, per user_id per date.
Upvotes: 2