Reputation: 341
I want to calculate cosine similarity, but I got an error message after converting the dataframe column to its list: Argument 'string' has incorrect type (expected str, got list).
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
df= [['24, Single, Consultant, Canada, I am interested in visiting Isreal again'], ['18, Single, Student, I want to go back Costa Rica again'], ['45,Married, Unemployed, I want to take my family to Florida for the summer vacation']]
df = pd.DataFrame(df, columns = ['Free Text'])
df["N_Application"]=range(0, len(df))
# convert datafram to list
data=df['Free Text'].tolist()
df_spacy=nlp(data)
I appreciate someone help me fix it, Thank you.
Upvotes: 1
Views: 535
Reputation: 169494
The way you get a function to operate across an entire pd.Series
is to use .apply()
. And you can chain .apply()
calls.
Example:
# changing to strings instead of nested list
l = ['24, Single, Consultant, Canada, I am interested in visiting Isreal again',
'18, Single, Student, I want to go back Costa Rica again',
'45,Married, Unemployed, I want to take my family to Florida for the summer vacation']
# remove stop words and punctuation for later similarity calculations
df_spacy = df['Free Text'].apply(nlp)\
.apply(lambda doc: nlp(' '.join(str(t)
for t in doc
if not t.is_stop
and not t.is_punct)))
Edit: per your comment, here is a similarity calculation between each row and all other rows:
df_spacy.apply(lambda row: df_spacy\
.apply(lambda doc: row.similarity(doc) if row != doc else None))
Resulting similarity matrix:
0 1 2
0 NaN 0.776098 0.716560
1 0.776098 NaN 0.705024
2 0.716560 0.705024 NaN
Upvotes: 1