Samira Kumar
Samira Kumar

Reputation: 521

Pandas string search partial string

I've the following data

list = ['good dog','bad cat']

pattern = '|'.join(list)

|column|
|---|
|bad cat|
|good dog|
|cat|
|dog|

When I do the string contains in pandas, only the fully matched string gets True output as below

df[column].str.contains(pattern,regex=True)

|column|
|---|
|True|
|True|
|False|
|False|

Would it be possible to do something like fuzzy match where partial strings within a pattern are also checked for? So that output would all be true since "Cat" and "Dog" are partially present?

Thanks.

Upvotes: 1

Views: 199

Answers (1)

piRSquared
piRSquared

Reputation: 294258

Custom Metric

Write a crude fuzzy match metric. You can probably adjust this metric by removing high frequency words and stemming appropriately.

def fuzz(a, b):
    a = np.asarray(a)
    b = np.asarray(b)
    c = a[:, None] == b[None, :]
    return min(c.max(0).mean(), c.max(1).mean())

This calculates how many words from one list matches how many words from another.

We build a dataframe to help illustrate.

d = pd.DataFrame([
    [fuzz(a, b) for b in map(str.split, lst)]
                for a in df.column.str.split()
], df.index, lst)

d

   good dog  bad cat
0       0.0      1.0
1       1.0      0.0
2       0.0      0.5
3       0.5      0.0

We can see that we get a metric of 1.0 for the first row and 'bad cat' and the second row and 'good dog'. For the third and fourth rows, we get measures of 0.5 meaning half the words matched.

Now you set a threshold and find if any in a row exceed the threshold:

For a threshold of .5

df[d.ge(.5).any(1)]

     column
0   bad cat
1  good dog
2       cat
3       dog

For a threshold of .6

df[d.ge(.6).any(1)]

     column
0   bad cat
1  good dog

Levenshtein

Use Levenshtein's distance ratio

import Levenshtein

c = pd.DataFrame([
    [Levenshtein.ratio(a, b) for b in lst]
    for a in df.column
], df.index, lst)

c

   good dog   bad cat
0  0.266667  1.000000
1  1.000000  0.266667
2  0.000000  0.600000
3  0.545455  0.200000

And you can do the same threshold analysis as above.

Upvotes: 1

Related Questions