Michael Hsu
Michael Hsu

Reputation: 59

Converting Docx to pure text

I am trying to convert docx files to text but keep getting an error. I am using python 2-7

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Traceback:

return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to <undefined>

Upvotes: 2

Views: 5872

Answers (2)

thisAaronMdev
thisAaronMdev

Reputation: 132

Looks like an issue with that right single quote. Can you do something like:

import docx

def getText(filename):
    doc = docx.Document(filename)
    new_doc = doc.replace(u"\u2019", "'")
    fullText = []
    for para in new_doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Responding from my phone so I can't test.

Upvotes: 0

mr.freeze
mr.freeze

Reputation: 14060

It looks like it doesn't like \u2019 and probably \u2018 either. These are the left and right single quotes. I'd encode the unicode data to ascii and ignore anything that it can't convert in order to remove them:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        txt = para.text.encode('ascii', 'ignore')
        fullText.append(txt)
    return '\n'.join(fullText)

Upvotes: 4

Related Questions