Saar Drimer
Saar Drimer

Reputation: 1191

Python lxml, unicode matching in SVG attribute

I am trying to find an XML element within an SVG (font) file based on the content of an attribute, like so:

font = et.ElementTree(file='fontfile.svg')
glyph = font.find('//n:glyph[@unicode="%s"]' % symbol, namespaces={'n': SVGNS})

Glyph examples -- what I'm trying to match to -- are:

<glyph unicode="&#xa9;" horiz-adv-x="1792" d="M834 ... -40t-121 -18z " />
<glyph unicode="C" horiz-adv-x="1509" d="M1766 338q-49 ... 83.5v-215z" />

Problem is that when, for example,

symbol = "C"

it works fine (there is a match), but when

symbol = "&#xa9;"

it doesn't. I suspect that there is a unicode interpretation in one direction of the matching, but not the other. What is the correct way to resolve this?

Upvotes: 1

Views: 268

Answers (2)

Fredrick Brennan
Fredrick Brennan

Reputation: 7357

Building on unutbu's answer, when you do ET.fromstring, it translates the HTML entities into unicode objects as attributes.

>>> import lxml.etree as ET
>>> 
>>> content = '''\
... <root xmlns="SVGNS">
... <glyph unicode="&#xa9;" horiz-adv-x="1792" d="M834 ... -40t-121 -18z " />
... <glyph unicode="C" horiz-adv-x="1509" d="M1766 338q-49 ... 83.5v-215z" />
... </root>'''
>>> font = ET.fromstring(content)
>>> font
<Element {SVGNS}root at 0x7fd7ab978410>
>>> font.xpath('//n:glyph', namespaces={'n':'SVGNS'})[0].attrib
{'horiz-adv-x': '1792', 'unicode': u'\xa9', 'd': 'M834 ... -40t-121 -18z '}

So, the answer at the end of the day is that the HTML entity &#xa9; no longer exists as such in font, to search for it it needs to be converted into unicode. Some ways to do that are explained here.

Upvotes: 1

unutbu
unutbu

Reputation: 879201

You could specify the symbol with unicode: symbol = u'\xa9'

import lxml.etree as ET

content = '''\
<root xmlns="SVGNS">
<glyph unicode="&#xa9;" horiz-adv-x="1792" d="M834 ... -40t-121 -18z " />
<glyph unicode="C" horiz-adv-x="1509" d="M1766 338q-49 ... 83.5v-215z" />
</root>'''

font = ET.fromstring(content)
symbol = u'\xa9'
for glyph in font.xpath(u'//n:glyph[@unicode="%s"]'%symbol,  namespaces={'n': 'SVGNS'}):
    print(ET.tostring(glyph))

yields

<glyph xmlns="SVGNS" unicode="&#xA9;" horiz-adv-x="1792" d="M834 ... -40t-121 -18z "/>

Upvotes: 2

Related Questions