Reputation: 33
i want to find the number of paragraphs, sentences, words and uniq words in a docx file. i already installed python-docx and nltk. i tried many things but nothing worked and i'm out of ideas right now.
this, for exemple, gives me uniq letters instead of unique words :
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
letexte = getText('demo.docx')
#print(letexte)
dist = FreqDist(letexte)
vocab = dist.keys()
print(len(dist))
print(vocab)
anyways... i'm lost.
can you show how you'd do it with a random demo.docx with more than 4 pages ? thank you
Upvotes: 0
Views: 74
Reputation: 116
To fing unique words in text you can use simple python script, just pass result of your getText()
to it and you will get the list with only unique items. From this list you can get the number of unique items applying len()
import re
...
def count_unique_words(text_string):
word_list = re.split('; |, |\*|\n |\s', text_string)
return list(dict.fromkeys(word_list))
...
print(len(count_unique_words(letexte))
Upvotes: 1