Faster way to iterate through Pandas Dataframe?

Question

I have a list of strings, let's say:

fruit_list = ["apple", "banana", "coconut"]

And I have some Pandas Dataframe, such like:

import pandas as pd

data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])

And I want to populate a new column based on a text search of the existing column 'fruit_source'. What I want populated is whatever element is matched to the specific column within the df. One way of writing it is:

df["fruit"] = NaN

for index, row in df.iterrows():
    for fruit in fruit_list:
        if fruit in row['fruit_source']:
            df.loc[index,'fruit'] = fruit
        else:
            df.loc[index,'fruit'] = "fruit not found"

In which the dataframe is populated with a new column of what fruit the fruit source collected.

When expanding this out to a larger dataframe, though, this iteration can pose to be an issue based on performance. Reason being, as more rows are introduced, the iteration explodes due to iterating through the list as well.

Is there more of an efficient method that can be done?

Corralien · Accepted Answer

Use str.extract with a regex pattern to avoid a loop:

import re

pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
                                .fillna('fruit not found')

Output:

>>> df
    fruit_source  value            fruit
0     Apple farm     10            Apple
1   Banana field     15           Banana
2  Coconut beach     14          Coconut
3     corn field     10  fruit not found

>>> pattern
'(apple|banana|coconut)'

Faster way to iterate through Pandas Dataframe?

Answers (2)

Related Questions