Reputation: 521
I've the following data
list = ['good dog','bad cat']
pattern = '|'.join(list)
|column|
|---|
|bad cat|
|good dog|
|cat|
|dog|
When I do the string contains in pandas, only the fully matched string gets True output as below
df[column].str.contains(pattern,regex=True)
|column|
|---|
|True|
|True|
|False|
|False|
Would it be possible to do something like fuzzy match where partial strings within a pattern are also checked for? So that output would all be true since "Cat" and "Dog" are partially present?
Thanks.
Upvotes: 1
Views: 199
Reputation: 294258
Write a crude fuzzy match metric. You can probably adjust this metric by removing high frequency words and stemming appropriately.
def fuzz(a, b):
a = np.asarray(a)
b = np.asarray(b)
c = a[:, None] == b[None, :]
return min(c.max(0).mean(), c.max(1).mean())
This calculates how many words from one list matches how many words from another.
We build a dataframe to help illustrate.
d = pd.DataFrame([
[fuzz(a, b) for b in map(str.split, lst)]
for a in df.column.str.split()
], df.index, lst)
d
good dog bad cat
0 0.0 1.0
1 1.0 0.0
2 0.0 0.5
3 0.5 0.0
We can see that we get a metric of 1.0
for the first row and 'bad cat'
and the second row and 'good dog'
. For the third and fourth rows, we get measures of 0.5
meaning half the words matched.
Now you set a threshold and find if any in a row exceed the threshold:
For a threshold of .5
df[d.ge(.5).any(1)]
column
0 bad cat
1 good dog
2 cat
3 dog
For a threshold of .6
df[d.ge(.6).any(1)]
column
0 bad cat
1 good dog
Use Levenshtein's distance ratio
import Levenshtein
c = pd.DataFrame([
[Levenshtein.ratio(a, b) for b in lst]
for a in df.column
], df.index, lst)
c
good dog bad cat
0 0.266667 1.000000
1 1.000000 0.266667
2 0.000000 0.600000
3 0.545455 0.200000
And you can do the same threshold analysis as above.
Upvotes: 1