Reputation:
I have a question regarding matching strings in a list to a column in a df.
I read this question Check if String in List of Strings is in Pandas DataFrame Column and understand, but my need is little different.
Code :
Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
'Price': [22000,25000,27000,35000, 29000],
'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}
df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])
search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)
df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)
Output I get :
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 True
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 False
4 NaN 29000 DEF 456 False
Output I want:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
Upvotes: 0
Views: 86
Reputation: 13841
If we use the rule you outlined 'If one word is true, then true', then this means that if a row in Brand column has '2019', then True
will be returned which I believe we don't want that. So
Having said that you can create a new list, which is the previous split()
version of your search_for_these_values
excluding years, using a list comprehension
, and use isin
with any
:
# list comprehension
import re
s = [word for cars in search_for_these_values for word in cars.split() if not re.search(r'\d{4}',word)]
# Assign True / False
df['Match'] = df['Brand'].str.split(expand = True).isin(s).any(1)
Prints back:
Brand Price Liscence Plate Match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
Upvotes: 0
Reputation: 29742
One way using word match:
pat = "|".join(search_for_these_values).replace(" ", "|")
match = df["Brand"].str.findall(r"\b(%s)\b" % pat)
Output:
0 [Honda]
1 []
2 [Ford, Focus]
3 [Audi, A4]
4 NaN
Name: Brand, dtype: object
You can then assign it back
df["match"] = match.str.len().ge(1)
Final output:
Brand Price Liscence Plate match
0 Honda Civic 22000 ABC 123 True
1 Toyota Corolla 25000 XYZ 789 False
2 Ford Focus 27000 CBA 321 True
3 Audi A4 35000 ZYX 987 True
4 NaN 29000 DEF 456 False
Upvotes: 1