Reputation: 105
When reading a word docx that contains tables and text with
into python with python-docx the symbols all just get dropped. The symbols were all created with the normal insert symbol steps. It says it is from the Font Symbol, Character code 179, from Symbol (decimal)
Python-docx is just showing it as ''. The same for the 'plus or minus' symbol to the left of it.
When reading the text from the paragraph (not the ones in a table) I use the following code:
def listText():
test = docx.Document('Problem.docx')
testp=test.paragraphs[0] #The only paragraph in the test docx
stringThatShouldHaveSymbol = testp.text
print(stringThatShouldHaveSymbol)
return stringThatShouldHaveSymbol
This only returns '' from a document that only contains those symbols. If it has the symbol then 10 it will just return 10.
I also tried this xml approach, but even that returned "".
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = docx.Document.XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")
How can I extract the data from these documents? Is there a way to programmatically convert these symbols(decimals) to Unicode(hex)?
Upvotes: 0
Views: 1708
Reputation: 8257
Try this
If you now read the docx file you should get your symbol.
Not sure why the symbol font doesn't work. In Arial, 179 that is a 3 superscript.
Upvotes: 0