Tatiana Goretskaya
Tatiana Goretskaya

Reputation: 576

Filter dataframe by a list of possible prefixes for specific column

What I'm trying to do is:

options = ['abc', 'def']
df[any(df['a'].str.startswith(start) for start in options)]

I want to apply a filter so I only have entries that have values in the column 'a' starting with one of the given options.

the next code works, but I need it to work with several options of prefixes...

start = 'abc'
df[df['a'].str.startswith(start)]

The error message is

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Read Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() but haven't got understanding of how to do so.

Upvotes: 5

Views: 6320

Answers (3)

Tatiana Goretskaya
Tatiana Goretskaya

Reputation: 576

One more solution:

# extract all possible values for 'a' column
all_a_values = df['a'].unique()
# filter 'a' column values by my criteria
accepted_a_values = [x for x in all_a_values if any([str(x).startswith(prefix) for prefix in options])]
# apply filter
df = df[df['a'].isin(accepted_a_values))]

Took it from here: remove rows and ValueError Arrays were different lengths

The solution provided by @Vaishali is the most simple and logical, but I needed the accepted_a_values list to iterate trough as well. This was not mentioned in the question, so I mark her answer as correct.

Upvotes: 0

Vaishali
Vaishali

Reputation: 38415

You can pass a tuple of options to startswith

df = pd.DataFrame({'a': ['abcd', 'def5', 'xabc', '5abc1', '9def', 'defabcb']})
options = ['abc', 'def']
df[df.a.str.startswith(tuple(options))]

You get

    a
0   abcd
1   def5
5   defabcb

Upvotes: 6

taras
taras

Reputation: 6915

You can try this:

mask = np.array([df['a'].str.startswith(start) for start in options]).any(axis=1)

it creates a Series for each start option and applies any along corresponding rows.

You were getting the error because built-in expects a list of bools but as the error message suggests "The truth value of a multiple valued object is ambiguous", so you rather need to use an array-aware any.

Upvotes: 2

Related Questions