Reputation: 29
I am currently working on project related to natural language processing and text mining i have write down a code to calculate the frequency of unique words in a text file.
Frequencey of: trypanosomiasis --> 0.0029
Frequencey of: deadly --> 0.0029
Frequencey of: yellow --> 0.0029
Frequencey of: humanassociated --> 0.0029
Frequencey of: successful --> 0.0029
Frequencey of: potential --> 0.0058
Frequencey of: which --> 0.0029
Frequencey of: cholera --> 0.01449
Frequencey of: antimicrobial --> 0.0029
Frequencey of: hostdirected --> 0.0029
Frequencey of: cameroon --> 0.0029
Is there any library or method that can remove common words, adjectives helping verbs etc. (Exm. "Which", "potential", this, "are" etc.) from a text file so that I can explore the or calculate the most likely occurrence of scientific terminology into a text data.
Upvotes: 0
Views: 65
Reputation: 3417
Usually in text analysis you remove stopwords - common words that hold little meaning about the text. These you can remove using nltk's stopwords (from https://pythonspot.com/en/nltk-stop-words/):
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
if w not in stopWords:
wordsFiltered.append(w)
print(wordsFiltered)
If there are additional words you want to remove, you can just add them to the set stopWords
Upvotes: 2