DrakeMurdoch
DrakeMurdoch

Reputation: 859

Get a subset of a dataframe where columns must contain substring (if not, ignore those rows)

There are a lot of questions similar to this, but none asking this specific question, and I can't figure out how to solve it.

Assume I have a dataframe df that looks like the one shown below, and assume each column can have other substrings (though, the substrings shown below are all unique to one column per row, eg if one column has 'Coin' in it, no other other column in that row can contain the substring 'Coin').

This example shows whole strings, but assume that a row with 'Coin' in it could actually be 'Coin (quarters)', such that it must search out the substring 'Coin' and not assume 'Coin' is the whole string.

>>> df

   1       2       3       4       cap
0  Coin    Bill    Charge          72
1  Coin                            13
2  Charge  Coin                     1
3  Bill    Other   Coin    Charge  57
4  Charge  Bill                    11
5  Coin    Charge                   8
6  Other   Bill    Charge   Coin   99
7  Charge                          45 
8  Charge  Coin                    15
9  Bill    Other   Coin            71

What I want to do is select rows if and only if columns contain 'Coin' and 'Charge' but none of the other unique substrings. Essentially, if and only if a column contains 'Coin' and a column contains 'Charge' and the rest of the columns DO NOT contain 'Bill' or 'Other' or they have an empty string.

So in the example above, the resulting dataframe would be a subset of df:

>>> df_res

   1       2       3       4       cap
2  Charge  Coin                     1
5  Coin    Charge                   8
8  Charge  Coin                    15

Upvotes: 1

Views: 67

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

Try:

x = df[
    df.apply(
        lambda x: "Coin" in (v := x.values)
        and "Charge" in v
        and "Bill" not in v
        and "Other" not in v,
        axis=1,
    )
]
print(x)

Prints:

        1       2 3 4  cap
2  Charge    Coin        1
5    Coin  Charge        8
8  Charge    Coin       15

EDIT: To search substrings:

x = df[
    df.apply(
        lambda x: any("Coin" in i for i in x if isinstance(i, str))
        and any("Charge" in i for i in x if isinstance(i, str))
        and not any("Bill" in i for i in x if isinstance(i, str))
        and not any("Other" in i for i in x if isinstance(i, str)),
        axis=1,
    )
]
print(x)

Prints:

        1               2 3 4  cap
2  Charge  Coin(quarters)        1
5    Coin          Charge        8
8  Charge            Coin       15

Upvotes: 2

Related Questions