Reputation:
I have a large dataframe to compare with another dataframe and correct the id. I'm gonna illustrate my problem into this simple exp.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
df = pd.DataFrame({'id':['nan','nan','nan'],
'description':['JOHN HAS 25 YEAR OLD LIVES IN At/12','STEVE has 50 OLD LIVES IN At.14','ALICIE HAS 10 YEAR OLD LIVES IN AT13']})
print(df)
df1 = pd.DataFrame({'id':[1203,1205,1045],
'description':['JOHN HAS 25year OLD LIVES IN At 2','STEVE has 50year OLD LIVES IN At 14','ALICIE HAS 10year OLD LIVES IN At 13']})
print(df1)
age = ["50year", "25year", "10year"]
for a in age:
ruler.add_patterns([{"label": "age", "pattern": a}])
names = ["JOHN", "STEVE", "ALICIA"]
for n in names:
ruler.add_patterns([{"label": "name", "pattern": n}])
ref = ["AT 2", "At 13", "At 14"]
for r in ref:
ruler.add_patterns([{"label": "ref", "pattern": r}])
#exp to check text difference
doc = nlp("JOHN has 25 YEAR OLD LIVES IN At.12 ")
for ent in doc.ents:
print(ent, ent.label_)
Actually there is a difference in the text of the two dataframe df and df1 which is the reference, as shown in the picture bellow
I dont know how to get similarties 100% in this case. I tried to use spacy but i dont how to fix difference and correct the id in df.
This is my dataframe1:
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
This my reference dataframe:
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
My expected output:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10year OLD LIVES IN AT15
NB:The sentences are not in the same order / The dataframes don't have equal length
Upvotes: 4
Views: 989
Reputation:
If the batch size is very large (and because using fuzzywuzzy is slow), we might be able to construct a KNN index using NMSLIB on some substring ngram embeddings (idea lifted from this article and this follow-up):
import re
import pandas as pd
import nmslib
from sklearn.feature_extraction.text import TfidfVectorizer
def ngrams(description, n=3):
description = description.lower()
description = re.sub(r'[,-./]|\sBD',r'', description)
ngrams = zip(*[description[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]
def build_index(descriptions, vectorizer):
ref_vectors = vectorizer.fit_transform(descriptions)
index = nmslib.init(method='hnsw',
space='cosinesimil_sparse',
data_type=nmslib.DataType.SPARSE_VECTOR)
index.addDataPointBatch(ref_vectors)
index.createIndex()
return index
def search_index(queries, vectorizer, index):
query_vectors = vectorizer.transform(query_df['description'])
results = index.knnQueryBatch(query_vectors, k=1)
return [res[0][0] for res in results]
# ref_df = df1, query_df = df
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
index = build_index(ref_df['description'], vectorizer)
results = search_index(query_df['description'], vectorizer, index)
query_ids = [ref_df['id'].iloc[ref_idx] for ref_idx in results]
query_df['id'] = query_ids
print(query_df)
This gives:
id description
0 3045 STEVE has 50 OLD LIVES IN At.14
1 1205 JOHN HAS 25 YEAR OLD LIVES IN At/12
2 3465 ALICIE HAS 10 YEAR OLD LIVES IN AT13
We can do more pre-processing in ngrams
, EG: stop words, handling symbols, etc.
Upvotes: 2
Reputation: 46
I think using spacy here is not the correct way. What you need to use is (1) regex (2) jaccard match. As it seems most of your tokens are supposed to exactly match, therefore Jaccard match, which calculates how many words are similar between two sentences; will be good. For the regex part, I would follow the following formatting:
import re
def text_clean(text):
#remove everything except alphabets
text = re.sub('[^A-Za-z.]', ' ', text)
text = text.lower()
return text
Now the above function, applied on all the strings, will remove the digits and all the '.', '/' etc characters. After that, if you apply Jaccard similarity, then you should get good matches.
I have suggested removing the digits, as in one of your examples, /12 turned into 2 and you still matched. So that meant that you are mainly concerned with the words and not the digits to be exact.
You may not get 100% accuracy using just the Jaccard match. Important thing is that you will not be able to get 100% Jaccard match in all matches, so you will have to put a cut-off on the value of the Jaccard match, above which you would want to consider a match.
You may want to come up with a more complex approach using both spacy and Jaccard match on the cleaned strings, and then putting custom cut off on both match scores and picking your matches.
Also, I noted that in some cases, you are getting two words occurring together. Is that only occurring with digits such as At13? or is it occurring with two words also? to use the Jaccard match efficiently, you will need to resolve that as well. But that's a whole other process and a bit out of scope for this answer.
Upvotes: 0
Reputation: 15593
Since you're looking for almost-identical strings, spaCy is not really the right tool for this. Word vectors are about meaning, but you're looking for string similarity.
Maybe this is just possible because of your simplified example, but you can normalize your strings by removing stuff that doesn't make a difference. For example,
text = "Alice lives at..."
text = text.replace(" ", "") # remove spaces
text = text.replace("/", "") # remove slashes
text = text.replace("year", "") # remove "year"
text = text.lower()
It seems like in most (all?) of your examples that would make your strings identical. You can then match strings by using their normalized forms as keys for a dictionary, for example.
This approach has an important advantage over the fuzzy matching described in the prior answer. While once you have two candidates using a string distance measure to see if they're close enough is important, you really don't want to compute string distance for every entry in both tables. If you normalize strings like I've suggested here, you can find matches without comparing each string with every string in the other table.
If the normalization strategy here doesn't work, look at simhash or other locality sensitive hashing techniques. A simplified version would be to use rare words, like the names in your example data, to create "buckets" of entries. Computing the string similarity of everything in a bucket is somewhat slow, but better than using the whole table.
Upvotes: 1
Reputation: 5802
As your strings are "almost" identical, here is a more simple suggestion using the string matching module fuzzywuzzy
which, as the name suggests, performs fuzzy string matching.
It offers a number of functions to compute string similarity, you can try out different ones and pick one that seems to work best. Given your example dataframes...
id description
0 nan STEVE has 50 OLD LIVES IN At.14
1 nan JOHN HAS 25 YEAR OLD LIVES IN At/12
2 nan ALICIE HAS 10 YEAR OLD LIVES IN AT15
id description
0 1203 STEVEN HAS 25year OLD lives in At 6
1 1205 JOHN HAS 25year OLD LIVES IN At 2
2 1045 ALICIE HAS 50year OLD LIVES IN At 13
3 3045 STEVE HAS 50year OLD LIVES IN At 14
4 3465 ALICIE HAS 10year OLD LIVES IN At 13
...even the most basic ratio
function seems to give us the correct result.
from fuzzywuzzy import fuzz
import numpy as np
import pandas as pd
fuzzy_ratio = np.vectorize(fuzz.ratio)
dist_matrix = fuzzy_ratio(df.description.values[:, None], df1.description.values)
dist_df = pd.DataFrame(dist_matrix, df.index, df1.index)
Result:
0 1 2 3 4
0 52 59 66 82 63
1 49 82 65 66 62
2 39 58 78 65 81
The row-wise maximum values suggest the following mappings:
Note, however, that it's a very close call in the last case, so this is not guaranteed to be always correct. Depending on what your data looks like, you might need more sophisticated heuristics. If all fails, you might even give vector-based similarity metrics like word movers distance a try but it seems overkill if the strings aren't really all that different.
Upvotes: 1