Converting Docx to pure text

Question

I am trying to convert docx files to text but keep getting an error. I am using python 2-7

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '
'.join(fullText)

Traceback:

return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to

mr.freeze · Accepted Answer

It looks like it doesn't like \u2019 and probably \u2018 either. These are the left and right single quotes. I'd encode the unicode data to ascii and ignore anything that it can't convert in order to remove them:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        txt = para.text.encode('ascii', 'ignore')
        fullText.append(txt)
    return '
'.join(fullText)

Converting Docx to pure text

Answers (2)

Related Questions