Reputation: 3751
in the below example, i am able to match a pandas column containing text data against multiple collection of strings. The output will only tell me whether any part of the df.col1
cell contains one of the elements from the collection. it will not tell me which one! I am interested in getting exactly that (the string matched or better still its positional location in the collection array)
words = ['dog', 'monkey']
pat = "|".join(map(re.escape, words))
df = pd.DataFrame({'col1':['lion bites dog','dog bites monkey','monkey bites man','man bites apple']})
df.loc[df.col1.str.contains(pat),'col1']
the reason why i need to know which string from the collection (words above) was matched is because each element of the collection could be mapped to a numeric value. like
words_dict = {'dog':'1', 'monkey':'2'}
i can perhaps try df.map(dict)
but in the actual case, the collection is stored in a pandas dataframe
words_df = pd.DataFrame({1:['dog'], 2:['monkey']})
i can think of a rather circuitous solution of checking for each element in the collection iteratively but that seems to be highly inefficient, if the number of elements in the collection is large.
edit//
the desired output can be either:
[0,0,1,NaN] or ['dog','dog','monkey',False]
Upvotes: 2
Views: 2072
Reputation: 294258
concept 1
using sets
s = df.col1.str.split().apply(set)
s - (s - set(words))
0 {dog}
1 {monkey, dog}
2 {monkey}
3 {}
Name: col1, dtype: object
concept 2
using str.get_dummies
df.col1.str.get_dummies(sep=' ')[words]
dog monkey
0 1 0
1 1 1
2 0 1
3 0 0
Stretching this to get desired results
d1 = df.col1.str.get_dummies(sep=' ')
d2 = d1.loc[:, d1.columns.intersection(words)]
d2[d2.any(1)].idxmax(1).reindex(d2.index)
0 dog
1 dog
2 monkey
3 NaN
dtype: object
concept 3
using numpy
s = df.col1.str.split(expand=True).stack()
a = s.values[:, None] == [words]
pd.Series(np.where(a.any(1), a.argmax(1), np.nan), s.index).groupby(level=0).min()
0 0.0
1 0.0
2 1.0
3 NaN
dtype: float64
Upvotes: 1