Reputation: 3282
I have a dataframe that looks like this:
question answer
Why did the chicken cross the road? to get to the other side
Who are you? a chatbot
Hello, how are you? Hi
.
.
.
What I'd like to do is use TF-IDF to train on this dataset. When the user enters a phrase, the question that matches the phrase the most will be chosen using cosine similarity. I am able to create the TF-IDF values this way for the sentences on the train dataset, but how do I come up with using this to find the cosine similarity score on the new phrase the user inputs?
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(intent_data["sentence"])
Upvotes: 0
Views: 1615
Reputation: 470
Try this:
Input:
question answer
0 Why did the chicken cross the road? to get to the other side
1 Who are you? a chatbot
2 Hello, how are you? Hi
#Script
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#data = Input dataframe as above
v = TfidfVectorizer()
sentence_input = ["hello, you"]
similarity_index_list = cosine_similarity(v.fit_transform(data["question"]), v.transform(sentence_input)).flatten()
output = data.loc[similarity_index_list.argmax(), "answer"]
Suggestion : Use some prediction based word embedding approach to maintain the context in the output vector, will get more accurate results in case of ambiguous sentences. (eg : fasttext, word2vec).
Upvotes: 1
Reputation: 914
I think you need something like
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(x, v.transform(['user input'])).flatten()
best_match_index = cosine_similarities.argmax()
Upvotes: 1