Reputation: 45
I want to search for a set of values in a particular column. If a match occurs, return the matching string. Currently I only get a true or false back. Here are the steps:
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
'Price': [22000,25000,27000,35000, 29000],
'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}
df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])
search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)
df['Match'] = df["Brand"].str.contains(pattern, na=False)
print(df)
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
I would like to have the following for column Match:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 Honda
1 Toyota Corolla 25000 XYZ 789 Toy
2 Ford Focus 27000 CBA 321 Ford Focus
3 Audi A4 35000 ZYX 987
4 NaN 29000 DEF 456
Upvotes: 2
Views: 164
Reputation: 4263
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
'Price': [22000,25000,27000,35000, 29000],
'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}
df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])
print(df.head())
search = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search)
#print (pattern)
df['Match']=df['Brand'].str.extract(r'(^'+pattern+')')
print(df.head())
Upvotes: 0
Reputation: 627537
You can use
pattern = r'({})'.format('|'.join(sorted(search_for_these_values, key=len, reverse=True)))
df['Match'] = df["Brand"].str.extract(pattern, expand=False)
Output:
>>> df
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 Honda
1 Toyota Corolla 25000 XYZ 789 Toy
2 Ford Focus 27000 CBA 321 Ford Focus
3 Audi A4 35000 ZYX 987 NaN
4 NaN 29000 DEF 456 NaN
Details:
sorted(search_for_these_values, key=len, reverse=True)
- since your keywords contain multiword entries, you need to first make sure the longer terms come before shorter ones in the resulting alternation pattern (since in NFA regex, the first alternative matched "wins" and the regex library stops searching for the rest of the alternatives at the current location)'|'.join(...)
- the alternation pattern is built from the sorted keywordsr'({})'.format(...)
- the alternations are enclosed with a capturing group that is necessary for Series.str.extract
to work correctly (it outputs the result only if there is at least one capturing group in the regex pattern).Upvotes: 3
Reputation: 153550
Try this using extract
with your pattern regex in a capture group:
df['Match'] = df['Brand'].str.extract(f'({pattern})')
Output:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 Honda
1 Toyota Corolla 25000 XYZ 789 Toy
2 Ford Focus 27000 CBA 321 Ford Focus
3 Audi A4 35000 ZYX 987 NaN
4 NaN 29000 DEF 456 NaN
Upvotes: 2