Consolidating and comparing the text per document

Question

I just started learning how NLP works. What I can do right now is to get the number of frequency of a specific word per document. But what I'm trying to do is to compare the four documents that I have to compare their similarities and different as well as displaying the words that are similar and the words that is unique to each document.

My documents are in .csv format imported using pandas. As each row has their own sentiment.

Eric McLachlan · Accepted Answer

To be honest, the question you're asking is very high level and difficult (maybe impossible) to answer on a forum like this. So here are some ideas that might be helpful:

You could try to use [term frequency–inverse document frequency (TFIDF)] (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to compare the vocabularies for similarities and differences. This is not a large step from your current word-frequency analysis.

For a more detailed analysis, it might be a good idea to substitute the words of your documents with something like wordnet's synsets. This makes it possible to compare the sentence meanings at a higher level of abstraction than the actual words themselves. For example, if each of your documents mentions "planes", "trains", and "automobiles", there is an underlying similarity (vehicle references) that a simple word comparison will ignore not be able to detect.

Consolidating and comparing the text per document

Answers (1)

Related Questions