user16253345
user16253345

Reputation:

Find similarties using nlp/spacy

I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.

import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})

df = pd.DataFrame({'id':['nan','nan','nan'],
                    'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has  50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)

df1 = pd.DataFrame({'id':[1203,1205,1045],
                    'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has  50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
    ruler.add_patterns([{"label": "age", "pattern": a}])

names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
    ruler.add_patterns([{"label": "name", "pattern": n}])

ref = ["AT 2", "At 13", "At 14"]
for r in ref:
    ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
    print(ent, ent.label_)

Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow

enter image description here I dont know how to get similarties 100% in this case. I tried to use spacy but i dont how to fix difference and correct the id in df.

This is my dataframe1:

   id                           description
0  nan      STEVE has 50 OLD LIVES IN At.14
1  nan   JOHN HAS 25 YEAR OLD LIVES IN At/12
2  nan  ALICIE HAS 10 YEAR OLD LIVES IN AT15

This my reference dataframe:

     id                           description
0  1203   STEVEN HAS 25year OLD lives in At 6
1  1205     JOHN HAS 25year OLD LIVES IN At 2
2  1045  ALICIE HAS 50year OLD LIVES IN At 13
3  3045   STEVE HAS 50year OLD LIVES IN At 14
4  3465  ALICIE HAS 10year OLD LIVES IN At 13

My expected output:

     id                           description
0   3045     STEVE has 50 OLD LIVES IN At.14
1   1205     JOHN HAS 25 YEAR OLD LIVES IN At/12
2   3465     ALICIE HAS 10year OLD LIVES IN AT15

NB:The sentences are not in the same order / The dataframes don't have equal length

Upvotes: 4

Views: 989

Answers (4)

anon
anon

Reputation:

If the batch size is very large (and because using fuzzywuzzy is slow), we might be able to construct a KNN index using NMSLIB on some substring ngram embeddings (idea lifted from this article and this follow-up):

import re
import pandas as pd
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer

def ngrams(description, n=3):
    description = description.lower()
    description = re.sub(r'[,-./]|\sBD',r'', description)
    ngrams = zip(*[description[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

def build_index(descriptions, vectorizer):
    ref_vectors = vectorizer.fit_transform(descriptions)
    index = nmslib.init(method='hnsw',
                        space='cosinesimil_sparse',
                        data_type=nmslib.DataType.SPARSE_VECTOR)
    index.addDataPointBatch(ref_vectors)
    index.createIndex()
    return index

def search_index(queries, vectorizer, index):
    query_vectors = vectorizer.transform(query_df['description'])
    results = index.knnQueryBatch(query_vectors, k=1)
    return [res[0][0] for res in results]

# ref_df = df1, query_df = df
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)

index = build_index(ref_df['description'], vectorizer)

results = search_index(query_df['description'], vectorizer, index)

query_ids = [ref_df['id'].iloc[ref_idx] for ref_idx in results]
query_df['id'] = query_ids
print(query_df)

This gives:

     id                           description
0  3045      STEVE has  50 OLD LIVES IN At.14
1  1205   JOHN HAS 25 YEAR OLD LIVES IN At/12
2  3465  ALICIE HAS 10 YEAR OLD LIVES IN AT13

We can do more pre-processing in ngrams, EG: stop words, handling symbols, etc.

Upvotes: 2

shyamcody
shyamcody

Reputation: 46

I think using spacy here is not the correct way. What you need to use is (1) regex (2) jaccard match. As it seems most of your tokens are supposed to exactly match, therefore Jaccard match, which calculates how many words are similar between two sentences; will be good. For the regex part, I would follow the following formatting:

import re
def text_clean(text):
    #remove everything except alphabets
    text = re.sub('[^A-Za-z.]', ' ', text)
    text = text.lower()
    return text

Now the above function, applied on all the strings, will remove the digits and all the '.', '/' etc characters. After that, if you apply Jaccard similarity, then you should get good matches.
I have suggested removing the digits, as in one of your examples, /12 turned into 2 and you still matched. So that meant that you are mainly concerned with the words and not the digits to be exact.
You may not get 100% accuracy using just the Jaccard match. Important thing is that you will not be able to get 100% Jaccard match in all matches, so you will have to put a cut-off on the value of the Jaccard match, above which you would want to consider a match.
You may want to come up with a more complex approach using both spacy and Jaccard match on the cleaned strings, and then putting custom cut off on both match scores and picking your matches.
Also, I noted that in some cases, you are getting two words occurring together. Is that only occurring with digits such as At13? or is it occurring with two words also? to use the Jaccard match efficiently, you will need to resolve that as well. But that's a whole other process and a bit out of scope for this answer.

Upvotes: 0

polm23
polm23

Reputation: 15593

Since you're looking for almost-identical strings, spaCy is not really the right tool for this. Word vectors are about meaning, but you're looking for string similarity.

Maybe this is just possible because of your simplified example, but you can normalize your strings by removing stuff that doesn't make a difference. For example,

text = "Alice lives at..."
text = text.replace(" ", "") # remove spaces
text = text.replace("/", "") # remove slashes
text = text.replace("year", "") # remove "year"
text = text.lower()

It seems like in most (all?) of your examples that would make your strings identical. You can then match strings by using their normalized forms as keys for a dictionary, for example.

This approach has an important advantage over the fuzzy matching described in the prior answer. While once you have two candidates using a string distance measure to see if they're close enough is important, you really don't want to compute string distance for every entry in both tables. If you normalize strings like I've suggested here, you can find matches without comparing each string with every string in the other table.

If the normalization strategy here doesn't work, look at simhash or other locality sensitive hashing techniques. A simplified version would be to use rare words, like the names in your example data, to create "buckets" of entries. Computing the string similarity of everything in a bucket is somewhat slow, but better than using the whole table.

Upvotes: 1

fsimonjetz
fsimonjetz

Reputation: 5802

As your strings are "almost" identical, here is a more simple suggestion using the string matching module fuzzywuzzy which, as the name suggests, performs fuzzy string matching.

It offers a number of functions to compute string similarity, you can try out different ones and pick one that seems to work best. Given your example dataframes...

    id                           description
0  nan       STEVE has 50 OLD LIVES IN At.14
1  nan   JOHN HAS 25 YEAR OLD LIVES IN At/12
2  nan  ALICIE HAS 10 YEAR OLD LIVES IN AT15

     id                           description
0  1203   STEVEN HAS 25year OLD lives in At 6
1  1205     JOHN HAS 25year OLD LIVES IN At 2
2  1045  ALICIE HAS 50year OLD LIVES IN At 13
3  3045   STEVE HAS 50year OLD LIVES IN At 14
4  3465  ALICIE HAS 10year OLD LIVES IN At 13

...even the most basic ratio function seems to give us the correct result.

from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd

fuzzy_ratio = np.vectorize(fuzz.ratio)

dist_matrix = fuzzy_ratio(df.description.values[:, None], df1.description.values)

dist_df = pd.DataFrame(dist_matrix, df.index, df1.index)

Result:

    0   1   2   3   4
0  52  59  66  82  63
1  49  82  65  66  62
2  39  58  78  65  81

The row-wise maximum values suggest the following mappings:

  • 'STEVE has 50 OLD LIVES IN At.14', 'STEVE HAS 50year OLD LIVES IN At 14'
  • 'JOHN HAS 25 YEAR OLD LIVES IN At/12', 'JOHN HAS 25year OLD LIVES IN At 2'
  • 'ALICIE HAS 10 YEAR OLD LIVES IN AT15', 'ALICIE HAS 10 YEAR OLD LIVES IN AT15'

Note, however, that it's a very close call in the last case, so this is not guaranteed to be always correct. Depending on what your data looks like, you might need more sophisticated heuristics. If all fails, you might even give vector-based similarity metrics like word movers distance a try but it seems overkill if the strings aren't really all that different.

Upvotes: 1

Related Questions