Melly Donald
Melly Donald

Reputation: 45

Search pandas dataframe column for particular set of string, and then that string

I want to search for a set of values in a particular column. If a match occurs, return the matching string. Currently I only get a true or false back. Here are the steps:

  1. Create df:
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])
  1. Search for particular set of values:
search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)
df['Match'] = df["Brand"].str.contains(pattern, na=False)
  1. Print df:
print(df)
Brand   Price           Liscence Plate      Match
0       Honda Civic     22000    ABC 123    True
1       Toyota Corolla  25000    XYZ 789    True
2       Ford Focus      27000    CBA 321    True
3       Audi A4         35000    ZYX 987    False
4       NaN             29000    DEF 456    False

I would like to have the following for column Match:

Brand   Price           Liscence Plate      Match
0       Honda Civic     22000    ABC 123    Honda
1       Toyota Corolla  25000    XYZ 789    Toy
2       Ford Focus      27000    CBA 321    Ford Focus
3       Audi A4         35000    ZYX 987    
4       NaN             29000    DEF 456    

Upvotes: 2

Views: 164

Answers (3)

 Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
'Price': [22000,25000,27000,35000, 29000],
'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

 df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])
 print(df.head())

 search = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
 pattern = '|'.join(search)

 #print (pattern)
 df['Match']=df['Brand'].str.extract(r'(^'+pattern+')')
 print(df.head())

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627537

You can use

pattern = r'({})'.format('|'.join(sorted(search_for_these_values, key=len, reverse=True)))
df['Match'] = df["Brand"].str.extract(pattern, expand=False)

Output:

>>> df
            Brand  Price Liscence Plate       Match
0     Honda Civic  22000        ABC 123       Honda
1  Toyota Corolla  25000        XYZ 789         Toy
2      Ford Focus  27000        CBA 321  Ford Focus
3         Audi A4  35000        ZYX 987         NaN
4             NaN  29000        DEF 456         NaN

Details:

  • sorted(search_for_these_values, key=len, reverse=True) - since your keywords contain multiword entries, you need to first make sure the longer terms come before shorter ones in the resulting alternation pattern (since in NFA regex, the first alternative matched "wins" and the regex library stops searching for the rest of the alternatives at the current location)
  • '|'.join(...) - the alternation pattern is built from the sorted keywords
  • r'({})'.format(...) - the alternations are enclosed with a capturing group that is necessary for Series.str.extract to work correctly (it outputs the result only if there is at least one capturing group in the regex pattern).

Upvotes: 3

Scott Boston
Scott Boston

Reputation: 153550

Try this using extract with your pattern regex in a capture group:

df['Match'] = df['Brand'].str.extract(f'({pattern})')

Output:

            Brand  Price Liscence Plate       Match
0     Honda Civic  22000        ABC 123       Honda
1  Toyota Corolla  25000        XYZ 789         Toy
2      Ford Focus  27000        CBA 321  Ford Focus
3         Audi A4  35000        ZYX 987         NaN
4             NaN  29000        DEF 456         NaN

Upvotes: 2

Related Questions