Reputation: 27
I am attempting to create a loop function to loop through my dataframe in python in order to compare text documents for a count vectorizer method and other similar comparison functions.
I have data of movie franchises and want to compare the plot of each sequel to the original film in the franchise, as well as the previous film in the franchise. I have attached a snippet of the data. For example, I want Seq 1 in FranID 1 to be compared to Seq 0 in FranID 1 and have this continue for each sequel and franchise. I would want Seq 2,3,4,5,etc. to be compared to Seq 0 within each FranID.
In addition, I would want a separate loop that compared each sequel to the previous film within each franchise. For example, I want to compare Seq 1 to Seq 0 and Seq 2 to Seq 1, etc.
Is there a way I can loop through the data in such for to implement it into this code or similar and then add it to the dataframe as a new variable for each film:
def cosine_distance_countvectorizer_method(s1, s2):
# sentences to list
allsentences = [s1 , s2]
# packages
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance
# text to vector
vectorizer = CountVectorizer()
all_sentences_to_vector = vectorizer.fit_transform(allsentences)
text_to_vector_v1 = all_sentences_to_vector.toarray()[0].tolist()
text_to_vector_v2 = all_sentences_to_vector.toarray()[1].tolist()
# distance of similarity
cosine = distance.cosine(text_to_vector_v1, text_to_vector_v2)
print('Similarity of two sentences are equal to ',round((1-cosine)*100,2),'%')
return cosine
Next line:
cosine_distance_countvectorizer_method(ss1 , ss2)
Data example:
Upvotes: 0
Views: 129
Reputation: 1401
based on the discussion in the chat here you go:
ss1 == df['plot']
ss2 == df['plot_prev']
to apply the function use args / kwargs
https://www.journaldev.com/33478/pandas-dataframe-apply-examples
Upvotes: 1