Reputation: 8975
Is there any function inside Python that can accept multiple rows of strings and return a percentage of how much similarity they have? something like SequenceMatcher
but for multiple strings.
For example we have the following sentences
Hello how are you?
Hi how are you?
hi how are you doing?
Hey how is your day?
I want to be able to get a percentage based on how similar the sentences are to each other
Let's say we have these three sentences
Hello how are you?
Hello how are you?
Hello how are you?
Then we should get 100% similar
but if we have
Hello how are you?
Hello how are you?
hola como estats?
then we should get a number to around 67% similarity.
Upvotes: 1
Views: 1521
Reputation: 36249
You can use numpy
to create a pairwise similarity matrix from itertools.product
. Then you can extract the desired similarity measure from that matrix. In any case you'd need to come up with a metric (i.e. pairwise quantifier) that suits your problem.
import itertools as it
import numpy as np
def similarity_check(sentences, metric):
pairwise = np.fromiter(map(
metric,
it.product(sentences, sentences)),
dtype=float).reshape(len(sentences), -1)
# return pairwise[np.triu_indices(len(sentences), 1)].mean() # Option 1.
return pairwise.mean(axis=0).max() # Option 2.
print(similarity_check([
'Hello how are you?',
'Hello how are you?',
'Hello how are you?'
], lambda x: float(x[0] == x[1]))) # Plug in your own metric here.
print(similarity_check([
'Hello how are you?',
'Hello how are you?',
'hola como estats?'
], lambda x: float(x[0] == x[1]))) # Plug in your own metric here.
Upvotes: 0
Reputation: 7410
You can use pandas
to operate with a dataframe, itertools.combinations
to calculate the combinations of 2 strings from your list and difflib.SequenceMatcher
for the similarity calculation:
import pandas as pd
import itertools
from difflib import SequenceMatcher
def similarity(a,b):
seq = SequenceMatcher(a=a, b=b)
return seq.ratio()
strings = ['Hello how are you?', 'Hi how are you?', 'hi how are you doing?', 'Hey how is your day?']
combinations = itertools.combinations(strings,2)
df = pd.DataFrame(list(combinations))
df['similarity'] = df.apply(lambda x: similarity(x[0],x[1]), axis=1)
df.similarity.mean()
0.68
Upvotes: 5
Reputation: 103764
Naively, you can do something along these lines:
from collections import Counter
from itertools import zip_longest
cases=[('Hello how are you?','Hello how are you?','Hello how are you?'),
('Hello how are you?','Hello how are you?','hola como estats?')]
for t in cases:
sums=[]
for st in zip_longest(*t,fillvalue='|'):
sums.append((st,(len(Counter(st))-1)/len(st)))
print(t)
print('\n'.join(map(str, sums)))
Prints:
('Hello how are you?', 'Hello how are you?', 'Hello how are you?')
(('H', 'H', 'H'), 0.0)
(('e', 'e', 'e'), 0.0)
(('l', 'l', 'l'), 0.0)
(('l', 'l', 'l'), 0.0)
(('o', 'o', 'o'), 0.0)
((' ', ' ', ' '), 0.0)
(('h', 'h', 'h'), 0.0)
(('o', 'o', 'o'), 0.0)
(('w', 'w', 'w'), 0.0)
((' ', ' ', ' '), 0.0)
(('a', 'a', 'a'), 0.0)
(('r', 'r', 'r'), 0.0)
(('e', 'e', 'e'), 0.0)
((' ', ' ', ' '), 0.0)
(('y', 'y', 'y'), 0.0)
(('o', 'o', 'o'), 0.0)
(('u', 'u', 'u'), 0.0)
(('?', '?', '?'), 0.0)
('Hello how are you?', 'Hello how are you?', 'hola como estats?')
(('H', 'H', 'h'), 0.3333333333333333)
(('e', 'e', 'o'), 0.3333333333333333)
(('l', 'l', 'l'), 0.0)
(('l', 'l', 'a'), 0.3333333333333333)
(('o', 'o', ' '), 0.3333333333333333)
((' ', ' ', 'c'), 0.3333333333333333)
(('h', 'h', 'o'), 0.3333333333333333)
(('o', 'o', 'm'), 0.3333333333333333)
(('w', 'w', 'o'), 0.3333333333333333)
((' ', ' ', ' '), 0.0)
(('a', 'a', 'e'), 0.3333333333333333)
(('r', 'r', 's'), 0.3333333333333333)
(('e', 'e', 't'), 0.3333333333333333)
((' ', ' ', 'a'), 0.3333333333333333)
(('y', 'y', 't'), 0.3333333333333333)
(('o', 'o', 's'), 0.3333333333333333)
(('u', 'u', '?'), 0.3333333333333333)
(('?', '?', '|'), 0.3333333333333333)
So you difference in the second case will be slightly less that 1/3 since there are two characters that are the same in the final Spanish sentence.
Then reduce that sequence to a total difference.
Upvotes: 1