Victorbug
Victorbug

Reputation: 137

Create column "status", based on the presence of different elements in another column, for each pair ("user_id","date")

I just want to make the column "status" shown in the table below. This column assigns the label "Complete" or "Incomplete" if the column "question" has the 4 elements ("a","b","c","d") for each pair "user_id" and "date"

df=(pd.DataFrame({'user_id':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4],
                  'date':["01-01-2020","01-01-2020","01-01-2020","01-01-2020","02-01-2020","02-01-2020","03-01-2020","04-01-2020",
                          "04-01-2020","04-01-2020","05-01-2020","05-01-2020","06-01-2020","06-01-2020","07-01-2020","08-01-2020",
                          "08-01-2020","09-01-2020","09-01-2020","09-01-2020","09-01-2020","10-01-2020","10-01-2020","11-01-2020",
                          "11-01-2020","12-01-2020","12-01-2020"],
    'question':["a","b","c","d","a","b","a","a","b","c","a","b","a","b","a","a","b","a","b","c","d","a","b","a","b","a","b"],
    'status': ["Complete","Complete","Complete","Complete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete",
               "Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Complete","Complete","Complete",
               "Complete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete"]}))
display_full(df)

|    |   user_id | date       | question   | status     |
|---:|----------:|:-----------|:-----------|:-----------|
|  0 |         1 | 01-01-2020 | a          | Complete   |
|  1 |         1 | 01-01-2020 | b          | Complete   |
|  2 |         1 | 01-01-2020 | c          | Complete   |
|  3 |         1 | 01-01-2020 | d          | Complete   |
|  4 |         1 | 02-01-2020 | a          | Incomplete |
|  5 |         1 | 02-01-2020 | b          | Incomplete |
|  6 |         1 | 03-01-2020 | a          | Incomplete |
|  7 |         2 | 04-01-2020 | a          | Incomplete |
|  8 |         2 | 04-01-2020 | b          | Incomplete |
|  9 |         2 | 04-01-2020 | c          | Incomplete |
| 10 |         2 | 05-01-2020 | a          | Incomplete |
| 11 |         2 | 05-01-2020 | b          | Incomplete |
| 12 |         2 | 06-01-2020 | a          | Incomplete |
| 13 |         2 | 06-01-2020 | b          | Incomplete |
| 14 |         3 | 07-01-2020 | a          | Incomplete |
| 15 |         3 | 08-01-2020 | a          | Incomplete |
| 16 |         3 | 08-01-2020 | b          | Incomplete |
| 17 |         3 | 09-01-2020 | a          | Complete   |
| 18 |         3 | 09-01-2020 | b          | Complete   |
| 19 |         3 | 09-01-2020 | c          | Complete   |
| 20 |         3 | 09-01-2020 | d          | Complete   |
| 21 |         4 | 10-01-2020 | a          | Incomplete |
| 22 |         4 | 10-01-2020 | b          | Incomplete |
| 23 |         4 | 11-01-2020 | a          | Incomplete |
| 24 |         4 | 11-01-2020 | b          | Incomplete |
| 25 |         4 | 12-01-2020 | a          | Incomplete |
| 26 |         4 | 12-01-2020 | b          | Incomplete |

Upvotes: 1

Views: 58

Answers (1)

sophocles
sophocles

Reputation: 13831

You could use GroupBy.transform with set to remove duplicates, and then the len function to count the resulting unique elements, which will allow us to see if the column "question" has the 4 elements ("a","b","c","d") for each pair "user_id" and "date".

import numpy as np
import pandas as pd

df['temp'] = df.groupby(['user_id','date'])['question'].transform(lambda x: len(set(x)))
df['status_new'] = np.where(df['temp'] <4,'Incomplete','Complete')
df.drop('temp',axis=1,inplace=True)

Will print:

    user_id        date question      status  status_new
0         1  01-01-2020        a    Complete    Complete
1         1  01-01-2020        b    Complete    Complete
2         1  01-01-2020        c    Complete    Complete
3         1  01-01-2020        d    Complete    Complete
4         1  02-01-2020        a  Incomplete  Incomplete
5         1  02-01-2020        b  Incomplete  Incomplete
6         1  03-01-2020        a  Incomplete  Incomplete
7         2  04-01-2020        a  Incomplete  Incomplete
8         2  04-01-2020        b  Incomplete  Incomplete
9         2  04-01-2020        c  Incomplete  Incomplete
10        2  05-01-2020        a  Incomplete  Incomplete
11        2  05-01-2020        b  Incomplete  Incomplete
12        2  06-01-2020        a  Incomplete  Incomplete
13        2  06-01-2020        b  Incomplete  Incomplete
14        3  07-01-2020        a  Incomplete  Incomplete
15        3  08-01-2020        a  Incomplete  Incomplete
16        3  08-01-2020        b  Incomplete  Incomplete
17        3  09-01-2020        a    Complete    Complete
18        3  09-01-2020        b    Complete    Complete
19        3  09-01-2020        c    Complete    Complete
20        3  09-01-2020        d    Complete    Complete
21        4  10-01-2020        a  Incomplete  Incomplete
22        4  10-01-2020        b  Incomplete  Incomplete
23        4  11-01-2020        a  Incomplete  Incomplete
24        4  11-01-2020        b  Incomplete  Incomplete
25        4  12-01-2020        a  Incomplete  Incomplete
26        4  12-01-2020        b  Incomplete  Incomplete

Try not dropping the temp column, to clearly see what it does. It will show how many questions were answered, per user_id per date.

Upvotes: 2

Related Questions