Massagran
Massagran

Reputation: 1821

pandas: create column conditioned on row containing a string in list

I have a dataframe with 20 columns, and 3 of those columns (always the same) may contain one or more of these strings ["fraction", "fractional", "1/x", "one fifth"].

I want to add a new column that says whether or not each row is "fractional" (in other words, contains one of those words). This column could have Y or N to indicate this.

I've tried to do it with iterrows, like so:

list_of_fractional_widgets = []

for index, row in df.iterrows():
    fractional_keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
    # use str to remove offending nan values
    xx = str(row["HeaderX"])
    yy = str(row["HeaderY"])
    zz = str(row["HeaderZ"])

    widget_data = [xx, yy, zz]

    for word in fractional_keywords:
        found = [True for x in widget_data if word in x]
        if len(found)>0:
            list_of_fractional_widgets.append('Y')
            break
    if len(found) ==0:
        list_of_fractional_widgets.append('N')

df['Fractional?'] = list_of_fractional_widgets

however, I'm trying to understand if there is a more pandas / numpy efficient way to do so. Something like:

np.where(df['HeaderX'].str.contains(fractional_keywords?)), True)

as described in this SO question, but using a list and different headers.

Upvotes: 0

Views: 459

Answers (1)

ALollz
ALollz

Reputation: 59579

Create a single pattern by joining all the words with '|'. Then we check the condition in each column separately using Series.str.contains and create a single mask using np.logical_or.reduce.

Sample Data

import pandas as pd
import numpy as np

keywords = ["fraction", "fractional", "1/x", "one fifth", "Fraction"]
np.random.seed(45)
df = pd.DataFrame(np.random.choice(keywords+list('abcdefghijklm'), (4,3)),
                  columns=['HeaderX', 'HeaderY', 'HeaderZ'])

Code

pat = '|'.join(keywords)

df['Fractional?'] = np.logical_or.reduce([df[col].str.contains(pat) 
                                          for col in ['HeaderX', 'HeaderY', 'HeaderZ']])

      HeaderX    HeaderY   HeaderZ  Fractional?
0           g  one fifth  fraction         True
1   one fifth   Fraction         k         True
2  fractional          j         d         True
3           j          d         h        False

As a bonus, Series.str.contains can accept a case=False argument to ignore case when matching so there is no need to separately specify both 'fraction' and 'Fraction' (or any arbitrary capitalization like 'FracTIOn').

Upvotes: 2

Related Questions