Reputation: 60
I am trying to build a Fake news classifier and I am quite new in this field. I have a column "title_1_en" which has the title for fake news and another column called "title_2_en". There are 3 target labels; "agreed", "disagreed", and "unrelated" if the title of the news in column "title_2_en" agrees, disagrees or is unrelated to that in the first column.
I have tried calculating basic cosine similarity between the two titles after converting the words of the sentences into vectors. This has resulted in the the cosine similarity score but this needs a lot of improvement as synonyms and semantic relationship has not been considered at all.
def L2(vector):
norm_value = np.linalg.norm(vector)
return norm_value
def Cosine(fr1, fr2):
cos = np.dot(fr1, fr2)/(L2(fr1)*L2(fr2))
return cos
Upvotes: 1
Views: 1258
Reputation: 6864
The most important thing here is how you convert the two sentences into vectors. There are multiple ways to do that and the most naive way is:
Spacy's similarity is a good place to start which does the averaging technique. From the docs:
By default, spaCy uses an average-of-vectors algorithm, using pre-trained vectors if available (e.g. the en_core_web_lg model). If not, the doc.tensor attribute is used, which is produced by the tagger, parser and entity recognizer. This is how the en_core_web_sm model provides similarities. Usually the .tensor-based similarities will be more structural, while the word vector similarities will be more topical. You can also customize the .similarity() method, to provide your own similarity function, which can be trained using supervised techniques.
Upvotes: 1