Reputation: 806
I have a dataframe
Account Message
454232 Hi, first example 1
321342 Now, second example
412295 hello, a new example 1 in the third row
432325 And now something completely different
I would like to check similarity between texts in Message column. I would need to choose one of the message as source to test (for example the first one) and create a new column with the output from similarity test. If I had two lists, I would do as follows
import spacy
spacyModel = spacy.load('en')
list1 = ["Hi, first example 1"]
list2 = ["Now, second example","hello, a new example 1 in the third row","And now something completely different"]
list1SpacyDocs = [spacyModel(x) for x in list1]
list2SpacyDocs = [spacyModel(x) for x in list2]
similarityMatrix = [[x.similarity(y) for x in list1SpacyDocs] for y in list2SpacyDocs]
print(similarityMatrix)
But I do not know how to do the same in pandas, creating a new column with similarity results.
Any suggestions?
Upvotes: 1
Views: 6725
Reputation: 2222
I am not sure about spacy
, but in order to compare the one text with other values in the columns I would use .apply()
and pass the match making function and set axis=1
for column-wise. Here is an example using SequenceMatcher
(I don't have spacy
for now).
test = 'Hi, first example 1'
df['r'] = df.apply(lambda x: SequenceMatcher(None, test, x.Message).ratio(), axis=1)
print(df)
Result:
Account Message r
0 454232 Hi, first example 1 1.000000
1 321342 Now, second example 0.578947
2 412295 hello, a new example 1 in the third row 0.413793
3 432325 And now something completely different 0.245614
So in your case, it will be a similar statement but using functions you have instead of SequenceMatcher
Upvotes: 2