Craig Bing
Craig Bing

Reputation: 319

How to create corpus from multiple docx files in Python

I have a folder that consists of various 10 docx files. I am trying to create a corpus, which should be a list of length 10. Each element of the list should refer to the text of each docx document.

I have following function to extract text from docx files:

            import os
            from nltk.corpus.reader.plaintext import PlaintextCorpusReader
            import glob 
            from docx import *
            def getText(filename):
                document = Document(filename)

                newparatextlist = []
                for paragraph in document.paragraphs:
                    newparatextlist.append(paragraph.text.strip().encode("utf-8")) 
                return newparatextlist

            path = 'pat_to_folder/*.docx'   
            files=glob.glob(path)  

            corpus_list = []
            for f in files:
                cur_corpus = getText(f)
                corpus_list.append(cur_corpus)

            corpus_list[0] 

However, if I have content as follows in my word documents: http://www.actus-usa.com/sampleresume.doc https://www.myinterfase.com/sjfc/resources/resource_view.aspx?resource_id=53

the above function creates a list of list. How can I simply create a corpus out of the files?

TIA!

Upvotes: 2

Views: 2989

Answers (2)

pratap
pratap

Reputation: 628

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpus = PlaintextCorpusReader(ROOT_PATH, '*.docx')

It should create corpus from all the content of docx files present in the ROOT_PATH

Upvotes: 0

Kurtis Pykes
Kurtis Pykes

Reputation: 351

I tried this on some different method for my problem. It also consisted of loading various docx files to a corpus... I made some slight changes to your code!

    def getText(filename):
        doc = Document(filename)
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text.strip("\n"))
        return " ".join(fullText)

    PATH = "path_to_folder/*.docx"
    files = glob.glob(PATH)

    corpus_list = []
    for f in files:
        cur_corpus = getText(f)
        corpus_list.append(cur_corpus)

hopefully this solves the problem!

Upvotes: 1

Related Questions