Reputation: 59
I am trying to convert docx files to text but keep getting an error. I am using python 2-7
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
Traceback:
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to <undefined>
Upvotes: 2
Views: 5872
Reputation: 132
Looks like an issue with that right single quote. Can you do something like:
import docx
def getText(filename):
doc = docx.Document(filename)
new_doc = doc.replace(u"\u2019", "'")
fullText = []
for para in new_doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
Responding from my phone so I can't test.
Upvotes: 0
Reputation: 14060
It looks like it doesn't like \u2019 and probably \u2018 either. These are the left and right single quotes. I'd encode the unicode data to ascii and ignore anything that it can't convert in order to remove them:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
txt = para.text.encode('ascii', 'ignore')
fullText.append(txt)
return '\n'.join(fullText)
Upvotes: 4