Python-docx ignoring non-unicode Symbols like 'greater than or equal to'

Question

When reading a word docx that contains tables and text with

into python with python-docx the symbols all just get dropped. The symbols were all created with the normal insert symbol steps. It says it is from the Font Symbol, Character code 179, from Symbol (decimal)

Python-docx is just showing it as ''. The same for the 'plus or minus' symbol to the left of it.

When reading the text from the paragraph (not the ones in a table) I use the following code:

def listText():
   test = docx.Document('Problem.docx')
   testp=test.paragraphs[0] #The only paragraph in the test docx
   stringThatShouldHaveSymbol = testp.text

   print(stringThatShouldHaveSymbol)

   return stringThatShouldHaveSymbol

This only returns '' from a document that only contains those symbols. If it has the symbol then 10 it will just return 10.

I also tried this xml approach, but even that returned "".

def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = docx.Document.XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text
for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========")

How can I extract the data from these documents? Is there a way to programmatically convert these symbols(decimals) to Unicode(hex)?

Python-docx ignoring non-unicode Symbols like 'greater than or equal to'

Answers (1)

Related Questions

Python-docx ignoring non-unicode Symbols like &#39;greater than or equal to&#39;

Answers (1)

Related Questions

Python-docx ignoring non-unicode Symbols like 'greater than or equal to'