toolater
toolater

Reputation: 23

xpath attribute contain special character

<img width="410" height="410" #src="http://XXXX1.png" src="http://xxxx2.png" alt=""/>

I want to extract image address http://xxxx1.png ,
I'm using /img/@#src but get nothing.it contain # char .

Any suggestions? thank you for help.

Upvotes: 2

Views: 867

Answers (2)

BlackJack
BlackJack

Reputation: 4679

If it's not valid XML you can't use XPath to query that attribute as the name is invalid syntax in an XPath expression.

As it is not even a valid HTML attribute name you will need a lenient HTML parser which doesn't choke on that attribute and even preserves it in the results instead of ignoring invalid attribute names. The combination BeautifulSoup with html5lib for parsing seems to work. The HTML parser in the Python standard library chokes on that attribute and lxml.html silently ignores it.

In [33]: import bs4

In [34]: source
Out[34]: '<img width="410" height="410" #src="http://XXXX1.png" src="http://xxxx2.png" alt=""/>'

In [35]: doc = bs4.BeautifulSoup(source, 'html5lib')

In [36]: doc.img.attrs
Out[36]: 
{u'#src': u'http://XXXX1.png',
 u'alt': '',
 u'height': u'410',
 u'src': u'http://xxxx2.png',
 u'width': u'410'}

In [37]: doc.img.attrs['#src']
Out[37]: u'http://XXXX1.png'

Upvotes: 0

Abel
Abel

Reputation: 57159

<img width="410" height="410" #src="http://XXXX1.png" src="http://xxxx2.png" alt=""/>

Unfortunately, you cannot do this with XPath, as this fragment is invalid XML. An XML NameChar cannot start, or contain, the hash symbol. And XPath can only deal with an XML tree, and from this fragment, you cannot create such a tree (any XML parser will break on that fragment).

To fix this, pre-process your not-really-XML and make it correct, by removing that symbol. Or fix it at the source, if you have access to this, by not generating invalid names to begin with.

Note: there is no mechanism in XML (or HTML for that matter) to use some kind of escape sequence. Entity references may only be used in values and text nodes.

Upvotes: 2

Related Questions