Reputation: 3146
I have a very large pd.Dataframe contains millions of records where PID
and Ses_ID
are both index columns, and Var_3
indicates the occurrence of some event.
PID | Ses_ID | Var_1 | Var_2 | Var_3 |
---|---|---|---|---|
001 | 001 002 003 |
0.7 0.8 0.9 |
0.5 0.4 0.3 |
0 1 0 |
002 | 004 005 006 007 008 |
0.8 0.7 0.8 0.2 0.8 |
0.2 0.1 0.7 0.2 0.2 |
0 0 1 0 1 |
I want to remove/filter out sessions following and including the first occurrence of Var_3==1
from each person's (indexed by PID
) records. Thus the provided example would result as:
PID | Ses_ID | Var_1 | Var_2 | Var_3 |
---|---|---|---|---|
001 | 001 | 0.7 | 0.5 | 0 |
002 | 004 005 |
0.8 0.7 |
0.2 0.1 |
0 0 |
I could iteratively add relevant sessions and corresponding PID
to a new dataframe but that would be extremely time-consuming given the size of the current dataframe. What would be an efficient way of achieving this? Many thanks!
Updated situation: I have found many rows have the same Ses_ID
. How do I remove sessions following (and including) the first occurrence of a particular column value? So for the example below, both rows for Ses_ID==005
would be removed because the event of Var_3==1
occurred in this session.
PID | Ses_ID | Var_1 | Var_2 | Var_3 |
---|---|---|---|---|
001 | 001 002 003 |
0.7 0.8 0.9 |
0.5 0.4 0.3 |
0 1 0 |
002 | 009 004 004 005 005 006 007 |
0.1 0.8 0.8 0.7 0.8 0.2 0.8 |
0.3 0.1 0.2 0.1 0.7 0.2 0.2 |
0 0 0 0 1 0 1 |
should be transformed to:
PID | Ses_ID | Var_1 | Var_2 | Var_3 |
---|---|---|---|---|
001 | 001 | 0.7 | 0.5 | 0 |
002 | 009 004 004 |
0.1 0.8 0.8 |
0.3 0.1 0.2 |
0 0 0 |
Upvotes: 2
Views: 275
Reputation: 195428
You can try to use boolean indexing:
# assuming PID, Ses_ID are indices:
mask = df.groupby(level=0)["Var_3"].cumsum().eq(0)
print(df[mask])
Prints:
Var_1 Var_2 Var_3
PID Ses_ID
1 1 0.7 0.5 0
2 4 0.8 0.2 0
5 0.7 0.1 0
EDIT:
g = df.groupby(level=0)
df["Var_3"] = g["Var_3"].transform(
lambda x: x.groupby(level=1).transform(sorted, reverse=True)
)
mask = g["Var_3"].cumsum().eq(0)
print(df[mask])
Prints:
Var_1 Var_2 Var_3
PID Ses_ID
1 1 0.7 0.5 0
2 4 0.8 0.2 0
Upvotes: 2