Cazforshort
Cazforshort

Reputation: 105

Python-docx ignoring non-unicode Symbols like 'greater than or equal to'

When reading a word docx that contains tables and text with

symbol

into python with python-docx the symbols all just get dropped. The symbols were all created with the normal insert symbol steps. It says it is from the Font Symbol, Character code 179, from Symbol (decimal)

adding symbol

Python-docx is just showing it as ''. The same for the 'plus or minus' symbol to the left of it.

When reading the text from the paragraph (not the ones in a table) I use the following code:

def listText():
   test = docx.Document('Problem.docx')
   testp=test.paragraphs[0] #The only paragraph in the test docx
   stringThatShouldHaveSymbol = testp.text

   print(stringThatShouldHaveSymbol)

   return stringThatShouldHaveSymbol

This only returns '' from a document that only contains those symbols. If it has the symbol then 10 it will just return 10.

I also tried this xml approach, but even that returned "".

def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = docx.Document.XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text
for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========") 

How can I extract the data from these documents? Is there a way to programmatically convert these symbols(decimals) to Unicode(hex)?

Upvotes: 0

Views: 1708

Answers (1)

cup
cup

Reputation: 8257

Try this

  1. Click on the symbol drop down and select (normal text)
  2. Now select your special symbol

If you now read the docx file you should get your symbol.

Not sure why the symbol font doesn't work. In Arial, 179 that is a 3 superscript.

Upvotes: 0

Related Questions