IngridvW
IngridvW

Reputation: 1

How to compare the value to each subsequent value in a row till a threshold based on multiple conditions is met

I reviewed this question on stack overflow In Python, how to compare the value to each subsequent value in a row until a condition is met and I would like to extend the criteria with other variables.

I have this dataframe in Python:

import pandas as pd
data = {"ID": [117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118], 
        "Date": ["2023-11-14", "2024-01-25", "2024-02-01", "2024-02-04", "2024-02-11", "2024-03-04",
        "2024-01-02", "2024-01-28", "2024-02-04", "2024-02-18", "2024-03-11", "2024-06-05"], 
        "status": ['S', 'S', 'S', 'E', 'E', 'E', 'E', 'E', 'S', 'S', 'S', 'E']}
df = pd.DataFrame(data)

What I would like to do is, compare the first date where variable "status" is 'S' to the next dates until the difference between the dates meets the threshold of 30 days. Then, once that row meets the threshold, I'd like to search to the next "status" is 'S' and this date will be checked to the next date and so on. Rows within the threshold have the same integer/id/name and status does not matter.

I would expect an extra column 'flag' group-based on ID

Expected output

Expected results for patient 117

Expected results for patient 118

Python-code so far: It is able to append the same integer as a date within 30 days and append a new integer as a date is over 30 days. But I also have to check the variable "status". I am struggling to check if status is equal to 'S' for every new reference date.

import pandas as pd
data = {"ID": [117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118], 
        "Date": ["2023-11-14", "2024-01-25", "2024-02-01", "2024-02-04", "2024-02-11", "2024-03-04",
        "2024-01-02", "2024-01-28", "2024-02-04", "2024-02-18", "2024-03-11", "2024-06-05"], 
        "status": ['S', 'S', 'S', 'E', 'E', 'E', 'E', 'E', 'S', 'S', 'S', 'E']}
df = pd.DataFrame(data)

# make custom function
def get_flag(d, thresh=30):
    dates = pd.to_datetime(d['Date'])
    status = d['status']
    ref = dates.iloc[0]
    result = [1]
    n = 2
    e = 2
    
    for date in dates.iloc[1:]:
        if (date - ref).days >= thresh:
            result.append(n)
            ref = date 
            n+=1
        else:
            result.append(e)
    return d.assign(flag=result)
        
# groupby + apply + custom function    
out = df.groupby('ID', group_keys=False).apply(get_flag)
out

Upvotes: 0

Views: 28

Answers (1)

mozway
mozway

Reputation: 260195

Since your logic is iterative, a loop is indeed a way to go. You can change your function to use the status as 0/1 and to refer to the last result. If you're over the date either set 0 or increment the result:

def get_flag(g, thresh=30):
    dates = pd.to_datetime(g['Date'])
    status = g['status'].eq('S').astype(int)
    ref = dates.iloc[0]
    s = status.iloc[0]
    result = [s]

    for i in range(1, len(g)):
        date = dates.iloc[i]
        stat = status.iloc[i]
        if (date - ref).days >= thresh:
            result.append(result[-1]+stat if stat else 0)
            ref = date 
        else:
            result.append(result[-1])
    return g.assign(flag=result)
        
# groupby + apply + custom function    
out = df.groupby('ID', group_keys=False).apply(get_flag)

Output:

     ID        Date status  flag
0   117  2023-11-14      S     1
1   117  2024-01-25      S     2
2   117  2024-02-01      S     2
3   117  2024-02-04      E     2
4   117  2024-02-11      E     2
5   117  2024-03-04      E     0
6   118  2024-01-02      E     0
7   118  2024-01-28      E     0
8   118  2024-02-04      S     1
9   118  2024-02-18      S     1
10  118  2024-03-11      S     2
11  118  2024-06-05      E     0

Upvotes: 1

Related Questions