Siraj S.
Siraj S.

Reputation: 3751

pandas str.contains match against multiple strings and get the matched values

in the below example, i am able to match a pandas column containing text data against multiple collection of strings. The output will only tell me whether any part of the df.col1 cell contains one of the elements from the collection. it will not tell me which one! I am interested in getting exactly that (the string matched or better still its positional location in the collection array)

words = ['dog', 'monkey']
pat = "|".join(map(re.escape, words))

df = pd.DataFrame({'col1':['lion bites dog','dog bites monkey','monkey bites man','man bites apple']})
df.loc[df.col1.str.contains(pat),'col1']

the reason why i need to know which string from the collection (words above) was matched is because each element of the collection could be mapped to a numeric value. like

words_dict = {'dog':'1', 'monkey':'2'}

i can perhaps try df.map(dict)but in the actual case, the collection is stored in a pandas dataframe

words_df = pd.DataFrame({1:['dog'], 2:['monkey']})

i can think of a rather circuitous solution of checking for each element in the collection iteratively but that seems to be highly inefficient, if the number of elements in the collection is large.

edit//

the desired output can be either:

[0,0,1,NaN] or ['dog','dog','monkey',False]

Upvotes: 2

Views: 2072

Answers (1)

piRSquared
piRSquared

Reputation: 294258

concept 1
using sets

s = df.col1.str.split().apply(set)

s - (s - set(words))

0            {dog}
1    {monkey, dog}
2         {monkey}
3               {}
Name: col1, dtype: object

concept 2
using str.get_dummies

df.col1.str.get_dummies(sep=' ')[words]

   dog  monkey
0    1       0
1    1       1
2    0       1
3    0       0

Stretching this to get desired results

d1 = df.col1.str.get_dummies(sep=' ')
d2 = d1.loc[:, d1.columns.intersection(words)]
d2[d2.any(1)].idxmax(1).reindex(d2.index)

0       dog
1       dog
2    monkey
3       NaN
dtype: object

concept 3
using numpy

s = df.col1.str.split(expand=True).stack()
a = s.values[:, None] == [words]

pd.Series(np.where(a.any(1), a.argmax(1), np.nan), s.index).groupby(level=0).min()

0    0.0
1    0.0
2    1.0
3    NaN
dtype: float64

Upvotes: 1

Related Questions