Alino Ikonic Buddy
Alino Ikonic Buddy

Reputation: 33

how do i can extract datas from a docx file?

i want to find the number of paragraphs, sentences, words and uniq words in a docx file. i already installed python-docx and nltk. i tried many things but nothing worked and i'm out of ideas right now.

this, for exemple, gives me uniq letters instead of unique words :

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

letexte = getText('demo.docx')
#print(letexte)

dist = FreqDist(letexte)
vocab = dist.keys()

print(len(dist))
print(vocab)

anyways... i'm lost.

can you show how you'd do it with a random demo.docx with more than 4 pages ? thank you

Upvotes: 0

Views: 74

Answers (1)

Sherzod Sadriddinov
Sherzod Sadriddinov

Reputation: 116

To fing unique words in text you can use simple python script, just pass result of your getText() to it and you will get the list with only unique items. From this list you can get the number of unique items applying len()

import re

...

def count_unique_words(text_string):
    word_list = re.split('; |, |\*|\n |\s', text_string)
    return list(dict.fromkeys(word_list))

...
print(len(count_unique_words(letexte))

Upvotes: 1

Related Questions