Reputation: 319
I have a folder that consists of various 10 docx files. I am trying to create a corpus, which should be a list of length 10. Each element of the list should refer to the text of each docx document.
I have following function to extract text from docx files:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
import glob
from docx import *
def getText(filename):
document = Document(filename)
newparatextlist = []
for paragraph in document.paragraphs:
newparatextlist.append(paragraph.text.strip().encode("utf-8"))
return newparatextlist
path = 'pat_to_folder/*.docx'
files=glob.glob(path)
corpus_list = []
for f in files:
cur_corpus = getText(f)
corpus_list.append(cur_corpus)
corpus_list[0]
However, if I have content as follows in my word documents: http://www.actus-usa.com/sampleresume.doc https://www.myinterfase.com/sjfc/resources/resource_view.aspx?resource_id=53
the above function creates a list of list. How can I simply create a corpus out of the files?
TIA!
Upvotes: 2
Views: 2989
Reputation: 628
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader(ROOT_PATH, '*.docx')
It should create corpus from all the content of docx files present in the ROOT_PATH
Upvotes: 0
Reputation: 351
I tried this on some different method for my problem. It also consisted of loading various docx files to a corpus... I made some slight changes to your code!
def getText(filename):
doc = Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text.strip("\n"))
return " ".join(fullText)
PATH = "path_to_folder/*.docx"
files = glob.glob(PATH)
corpus_list = []
for f in files:
cur_corpus = getText(f)
corpus_list.append(cur_corpus)
hopefully this solves the problem!
Upvotes: 1