Reputation: 93
I have a dataset which contains pairs of names, it looks like this:
ID; name1; name2
1; Mike Miller; Mike Miler
2; John Doe; Pete McGillen
3; Sara Johnson; Edita Johnson
4; John Lemond-Lee Peter; John LL. Peter
5; Marta Sunz; Martha Sund
6; John Peter; Johanna Petera
7; Joanna Nemzik; Joanna Niemczik
I have some cases, which are labelled. So I check them manually and decide if these are duplicates or not. The manual judgement in these cases would be:
1: Is a duplicate
2: Is not a duplicate
3: Is not a duplicate
4: Is a duplicate
5: Is not a duplicate
6: Is not a duplicate
7: Is a duplicate
(The 7th case is a specific case, because here phonetics come into the game too. However, this is not the main problem, I am ok with ignoring phonetics.)
A first approach would be to calculate the Levenshtein-distance for each pair and mark those as a duplicate, where the Levenshtein-distance is for example less or equal than 2. This would lead to the following output:
1: Levenshtein distance: 2 => duplicate
2: Levenshtein distance: 11 => not a duplicate
3: Levenshtein distance: 4 => not a duplicate
4: Levenshtein distance: 8 => not a duplicate
5: Levenshtein distance: 2 => duplicate
6: Levenshtein distance: 4 => not a duplicate
7: Levenshtein distance: 2 => duplicate
This would be an approach which uses a "fixed" algorithm based on the Levinshtein distance.
Now, I would like to do this task with using a neural network / machine learning:
I do not need the neural network to detect semantic similarity, like "hospital" and "clininc". However, I would like to avoid the Levenshtein-distance, as I would like the ML algorithm to be able to detect "John Lemond-Lee Peter" and "John LL. Peter" as a potential duplicate, also not with a 100% certainty. The Levenshtein distance would lead to a relative high number in this case (8), as there are quite some characters to be added. In a case like "John Peter" and "Johanna Petera" the Levenshtein-distance would lead to a smaller number (4), however this is in fact no duplicate and for this case I would hope that the ML algorithm would be able to detect that this is likely not a duplicate. So I need the ML algorithm to "learn the way I need the duplicates to be checked". With my labelling I would give as an input I would give the ML algorithm the direction, of what I want.
I actually thought that this should be an easy task for a ML algorithm / neural network, but I am not sure.
How can I implement a neural network to compare the pairs of names and identify duplicates without using an explicit distance metric (like the Levenshtein distance, euclidean etc.)?
I thought that it would be possible to convert the strings to numbers and a neural network can work with this and learn to detect duplicates according to my labelling style. So without having to specify a distance metric. I thought about an human: I would give this task to a person and this person would judge and make a decision. This person has no clue about a Levenshtein-distance or any other mathematical concept. So I just want to train the neural network to learn to do what the human is doing. Of course, every human is different and it also depends on my labelling.
(Edit: The ML/neural network solutions I have seen so far (like this) use a metric like levenshtein as a feature input. But as I said I thought it should be possible to teach the neural network the "human judgement" without making use of such a distance measure? Regarding my specific case with having pairs of names: What would the benefit be a of a ML approach using levenshtein distance as a feature? Because it will just detect those pairs of names as a duplicate that have a low levenshtein distance. So I could use a simple algorithm to mark a pair as duplicate if the levenshtein distance between the two names is less than x. Why use a ML instead, what would be the additional benefit?)
Upvotes: 5
Views: 1909
Reputation: 11
Working solution using OpenAI embedding
from openai import OpenAI
import numpy as np
# Define the cosine similarity function
def cosine_similarity(v1, v2):
"""Compute the cosine similarity between two vectors."""
dot_product = np.dot(v1, v2)
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
similarity = dot_product / (norm_v1 * norm_v2)
return similarity
# Function to get embedding from OpenAI
def get_embedding(client, text, model):
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
embedding = response.data[0].embedding
return np.array(embedding)
# List of name pairs
name_pairs = [
("Mike Miller", "Mike Miler"),
("John Doe", "Pete McGillen"),
("Sara Johnson", "Edita Johnson"),
("John Lemond-Lee Peter", "John LL. Peter"),
("Marta Sunz", "Martha Sund"),
("John Peter", "Johanna Petera"),
("Joanna Nemzik", "Joanna Niemczik")
]
# Main script to compute and print similarities
def main():
# Initialize OpenAI client with your API key
#openai.api_key = 'your_api_key_here'
client = OpenAI()
model="text-embedding-ada-002"
# Iterate over each pair and calculate similarity
for idx, (name1, name2) in enumerate(name_pairs, start=1):
embedding_1 = get_embedding(client, name1, model)
embedding_2 = get_embedding(client, name2, model)
similarity = cosine_similarity(embedding_1, embedding_2)
duplicate_status = "probably a duplicate" if similarity > 0.60 else "not a duplicate"
print(f"Pair {idx}: {name1} <-> {name2}, Cosine similarity: {similarity:.2f}. This pair is {duplicate_status}.")
if __name__ == "__main__":
main()
Pair 1: Mike Miller <-> Mike Miler, Cosine similarity: 0.80. This pair is probably a duplicate.
Pair 2: John Doe <-> Pete McGillen, Cosine similarity: 0.25. This pair is not a duplicate.
Pair 3: Sara Johnson <-> Edita Johnson, Cosine similarity: 0.56. This pair is not a duplicate.
Pair 4: John Lemond-Lee Peter <-> John LL. Peter, Cosine similarity: 0.65. This pair is probably a duplicate.
Pair 5: Marta Sunz <-> Martha Sund, Cosine similarity: 0.55. This pair is not a duplicate.
Pair 6: John Peter <-> Johanna Petera, Cosine similarity: 0.58. This pair is not a duplicate.
Pair 7: Joanna Nemzik <-> Joanna Niemczik, Cosine similarity: 0.69. This pair is probably a duplicate.
Upvotes: 1
Reputation: 1087
I have read carefully whole your question, but still I don't know why you want a neural network for that.
Real, sad answer
Tweak edit distance (more general distance than Levenshtein) by adding some weights - idea: swapping characters that are close on the keyboard is more likely than those that are faraway. So distance between Asa
and Ada
is smaller than Asa
and Ala
.
Case (4) you can cover with regex.
Happy answer If you insist to go with ML solutions, here is the sketch of what I would do if forced:
Upvotes: 1
Reputation: 1850
The task you are solving is usually called Fuzzy Matching. There are some libraries that implement well known algorithms that may help you, like fuzzyset
, fuzzywuzzy
or difflib
. Consider giving a try to some of those.
If you still need to look for machine learning approche, consider that your first requirement is a dataset with pair of texts labeled as match or not match and then implement a binary classifier.
In general rules, classical achine learning algorithms require less data and less parameter tunning to solve the task, but you need to provide better features to the model (which ofteneans you spend more time in the feature engineering stage), but I think your problem is simple-enough to be solved with just machine learning.
If you want to try neutral networks you could try Siamese networks or implement a binary classifier.
That said, make sure that your implementation consider the input text at char level instead of word level.
Upvotes: 0
Reputation: 1489
In my experience, OpenAI's GPT-3 works well with such tasks (I'm using it for analyzing astrophysical texts). You should describe a task in the natural language and then provide a few examples for few-shot learning. Here's the quick experiment I've performed in OpenAI Playground (green text was generated by GPT-3):
Upvotes: 1
Reputation: 338
A naive approach will be somewhat similar to using Levenstein distance. First, convert both names to vectors via pretrained language model (I think FastText will be the best choice as it uses ngrams and will be more sensetive to chars). Than combine these two vectors (the first thing that came to mind is to compute a metric, e.g. calculate Euclidian distance between them). Now, you can see this task as a classification problem, and you can pass calculated metric (or other function) and label (duplicate/not duplicate) to classifier. So, in fact you'll be still computing distance between names but instead of names themself, it will be their high-dimensional representation.
Probably this approach isn't a best choice but it can be a nice baseline for your task. Your problem is a special case of a so called Similarity learning, so you can do a research and choose a specific method from this field.
Also you can take a look on this paper. There authors use character-based measures to vectorize texts and than pass them to ML models.
Upvotes: 0