user7298979
user7298979

Reputation: 549

Faster way to iterate through Pandas Dataframe?

I have a list of strings, let's say:

fruit_list = ["apple", "banana", "coconut"]

And I have some Pandas Dataframe, such like:

import pandas as pd

data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])

And I want to populate a new column based on a text search of the existing column 'fruit_source'. What I want populated is whatever element is matched to the specific column within the df. One way of writing it is:

df["fruit"] = NaN

for index, row in df.iterrows():
    for fruit in fruit_list:
        if fruit in row['fruit_source']:
            df.loc[index,'fruit'] = fruit
        else:
            df.loc[index,'fruit'] = "fruit not found"

In which the dataframe is populated with a new column of what fruit the fruit source collected.

When expanding this out to a larger dataframe, though, this iteration can pose to be an issue based on performance. Reason being, as more rows are introduced, the iteration explodes due to iterating through the list as well.

Is there more of an efficient method that can be done?

Upvotes: 4

Views: 2806

Answers (2)

Corralien
Corralien

Reputation: 120391

Use str.extract with a regex pattern to avoid a loop:

import re

pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
                                .fillna('fruit not found')

Output:

>>> df
    fruit_source  value            fruit
0     Apple farm     10            Apple
1   Banana field     15           Banana
2  Coconut beach     14          Coconut
3     corn field     10  fruit not found

>>> pattern
'(apple|banana|coconut)'

Upvotes: 5

AKX
AKX

Reputation: 168834

You can let Pandas do the work like so:

# Prime series with the "fruit not found" value
df['fruit'] = "fruit not found"
for fruit in fruit_list:
    # Generate boolean series of rows matching the fruit
    mask = df['fruit_source'].str.contains(fruit, case=False)
    # Replace those rows in-place with the name of the fruit
    df['fruit'].mask(mask, fruit, inplace=True)

print(df) will then say

    fruit_source  value            fruit
0     Apple farm     10            apple
1   Banana field     15           banana
2  Coconut beach     14          coconut
3     corn field     10  fruit not found

Upvotes: 6

Related Questions