eliss
eliss

Reputation: 89

FIltering dataframe groups with "or" condition

I am dealing with a dataframe such this one:

    id        Xp_1  Xp_2   Xp_4   Xt_1  Xt_2  Xt_3  Mp_1   Mp_2  Mp_3  Mt_1  Mt_2 Mt_6
0    i24     Nan    0.27   Nan    0.45  0.20  0.25  0.27  Nan    Nan   Nan   Nan  Nan
1    i25     0.45   0.47   0.46   0.22  0.42  Nan   0.42  0.05   0.43  0.12  0.01  0.04
2    i11     Nan    Nan    0.32   0.14  0.32  0.35  0.29  0.33   Nan   Nan   0.02  0.44
3    i47     Nan    0.56   0.59   0.92  Nan   0.56  0.51  0.12   Nan   0.1   0.1   Nan

As you can see, I have something like two macro-groups (X and M), and for each macro-group two subsets (p and t). What I would like to implement is a "or" condition between the two macro-groups and a "and" condition between each subset of the macro-group.

Basically, I'd like to keep those lines that have at least two values for each subset in at least one group. For example: i24 should be discarded, in fact, we only have one value for the Xps, moreover, we don't have any value for the M group. Entries like i11 should be kept, in fact, the condition is not satisfied for the X group, but it is satisfied for the M. The same goes for i25, which satisfies the condition in both groups.

I tried this:

keep_r = (df.groupby(lambda col: col.split("_", maxsplit=1)[0], axis=1)
            .count()
            .ge(2)
            .all(axis=1))
df = df.loc[keep_r]

but it checks whether in all subsets (Xp, Xt, Mp, Mt) there are at least two values. Instead, I want to treat X and M independently.

Thank you!

Upvotes: 1

Views: 71

Answers (2)

Henry Ecker
Henry Ecker

Reputation: 35676

IIUC Try creating a MultiIndex from pattern str.extract:

df = df.set_index('id')
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract('(.)(.)_(.+)'))
0       X                                   M                              
1       p                 t                 p                 t            
2       1     2     4     1     2     3     1     2     3     1     2     6
id                                                                         
i24   NaN  0.27   NaN  0.45  0.20  0.25  0.27   NaN   NaN   NaN   NaN   NaN
i25  0.45  0.47  0.46  0.22  0.42   NaN  0.42  0.05  0.43  0.12  0.01  0.04
i11   NaN   NaN  0.32  0.14  0.32  0.35  0.29  0.33   NaN   NaN  0.02  0.44
i47   NaN  0.56  0.59  0.92   NaN  0.56  0.51  0.12   NaN  0.10  0.10   NaN

Then groupby levels 0 and 1 to count then apply separate logic to each level.:

keep = (
    df.groupby(axis=1, level=[0, 1]).count()
        .ge(2).all(axis=1, level=0).any(axis=1)
)
id
i24    False
i25     True
i11     True
i47     True
dtype: bool

Then filter down and collapse MultiIndex:

df = df.loc[keep]
df.columns = df.columns.map(lambda c: f'{"".join(c[:-1])}_{c[-1]}')
df = df.reset_index()
    id  Xp_1  Xp_2  Xp_4  Xt_1  Xt_2  Xt_3  Mp_1  Mp_2  Mp_3  Mt_1  Mt_2  Mt_6
0  i25  0.45  0.47  0.46  0.22  0.42   NaN  0.42  0.05  0.43  0.12  0.01  0.04
1  i11   NaN   NaN  0.32  0.14  0.32  0.35  0.29  0.33   NaN   NaN  0.02  0.44
2  i47   NaN  0.56  0.59  0.92   NaN  0.56  0.51  0.12   NaN  0.10  0.10   NaN

Upvotes: 1

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

We can groupby over 2 things: X, M and p, t which are column names' first & second characters. Then we can invoke your .count().ge(2).all(axis=1) logic but over the p and t's level. Then we put the or condition via any:

# to keep the `id` column aside
df = df.set_index("id")

# groups
c = df.columns
g = df.groupby([c.str[0], c.str[1]], axis=1)

# boolean mask
mask = (g.count()
         .ge(2)
         .all(axis=1, level=0)     # micros: and
         .any(axis=1))             # macros: or

# new df
ndf = df[mask]

to get

>>> ndf

     Xp_1  Xp_2  Xp_4  Xt_1  Xt_2  Xt_3  Mp_1  Mp_2  Mp_3  Mt_1  Mt_2  Mt_6
id
i25  0.45  0.47  0.46  0.22  0.42   NaN  0.42  0.05  0.43  0.12  0.01  0.04
i11   NaN   NaN  0.32  0.14  0.32  0.35  0.29  0.33   NaN   NaN  0.02  0.44
i47   NaN  0.56  0.59  0.92   NaN  0.56  0.51  0.12   NaN   0.1   0.1   NaN

For illustration, before invoking all and any, we had:

>>> g.count().ge(2)

         M             X
         p      t      p     t
id
i24  False  False  False  True
i25   True   True   True  True
i11   True   True  False  True
i47   True   True   True  True

Then all over level 0 i.e., over p, t reduced this one step with and logic:

>>> g.count().ge(2).all(axis=1, level=0)

         M      X
id
i24  False  False
i25   True   True
i11   True  False
i47   True   True

and finally any over the remaining M, X reduced it to a boolean series with or logic and this says which rows to keep:

>>> g.count().ge(2).all(axis=1, level=0).any(axis=1)

id
i24    False
i25     True
i11     True
i47     True
dtype: bool

Upvotes: 3

Related Questions