Reputation: 4018
I am dealing with a pandas dataframe that looks like this:
0 Item Category
1 Hammer A
2 Car A
3 Cardiologist B
4 Park A
5 Parkway A
6 CarparkCar A
I need to find all cells in Item
that contain a specific sequence of letters, defined as substring of my reference word. I am not interested in cells that contain additional letters not present in my reference word.
reference_word = Carpark
Desired output:
0 Item Category Contains_substring_of_reference_word
1 Hammer A FALSE
2 Car A TRUE
3 Cardiologist B FALSE
4 Park A TRUE
5 Parkway A FALSE
6 CarparkCar A TRUE
How to check which cell of a pandas dataframe column contains a substring of a given word/string?
Upvotes: 1
Views: 1427
Reputation: 294488
You can create sets with a comprehension and use <=
to signify subset.
sets = np.array([set(x.lower()) for x in df.Item.values.tolist()])
df.assign(Bool=sets <= set('carpark'))
Item Category Bool
0
1 Hammer A False
2 Car A True
3 Cardiologist B False
4 Park A True
5 Parkway A False
6 CarparkCar A True
Upvotes: 1
Reputation: 402854
Option 1
One simple way to do this is using apply
+ set.issubset
-
v = df.Item.str.lower().apply(lambda x: set(x).issubset('carpark'))
v
0 False
1 True
2 False
3 True
4 False
5 True
Name: Item, dtype: bool
Assign the result back -
df['Contains_substring_of_reference_word'] = v
df
Item Category Contains_substring_of_reference_word
0 Hammer A False
1 Car A True
2 Cardiologist B False
3 Park A True
4 Parkway A False
5 CarparkCar A True
Option 2
Another solution using set.difference
operations -
(df.Item.str.lower().apply(set) - set('carpark')).str.len() == 0
0 False
1 True
2 False
3 True
4 False
5 True
Name: Item, dtype: bool
Option 3
Adding another option with np.vectorize
, this should be faster.
c = set('carpark')
def foo(x):
return c.issuperset(x.lower())
v = np.vectorize(foo)
v(df.Item)
array([False, True, False, True, False, True], dtype=bool)
Timings
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df.Item.str.lower().apply(lambda x: set(x).issubset('carpark'))
1 loop, best of 3: 927 ms per loop
%timeit (df.Item.str.lower().apply(set) - set('carpark')).str.len() == 0
1 loop, best of 3: 1.13 s per loop
%timeit v(df.Item)
1 loop, best of 3: 497 ms per loop
Upvotes: 3