Panagiotis Koursaris
Panagiotis Koursaris

Reputation: 4023

XPath - How to get image source from xml

Hello i have this xml

        <item>
        <title> Something for title»</title>
        <link>some url</link>
        <description><![CDATA[<div class="feed-description"><div class="feed-image"><img src="pictureUrl.jpg" /></div>text for desc</div>]]></description>
        <pubDate>Thu, 11 Jun 2015 16:50:16 +0300</pubDate>
    </item>

I try to get the img src with path: //description//div[@class='feed-description']//div[@class='feed-image']//img/@src but it doesn't work

is there any solution?

Upvotes: 1

Views: 1333

Answers (1)

LarsH
LarsH

Reputation: 27994

A CDATA section escapes its contents. In other words, CDATA prevents its contents from being parsed as markup when the rest of the document is parsed. So the <div>s in there are not seen as XML elements, only as flat text. The <description> element has no element children ... only a single text child. As such, XPath can't select any <div> descendant of <description> because none exists in the parsed XML tree.

What to do?

If your XPath environment supports XPath 3.0, you could use parse-xml() to turn the flat text into a tree, then use XPath to select //div[@class='feed-description']//div[@class='feed-image']//img/@src from the resulting tree.

Otherwise, your best workaround may be to use primitive string-processing functions like substring-before(), substring-after(), or match(). (The latter uses regular expressions and requires XPath 2.0.) Of course, many people will tell you not to use regular expressions to analyze markup like XML and HTML. For good reason: in the general case, it's very difficult to do it right (with regexes or with plain string searches). But for very restricted cases where the input is highly predictable, and in absence of better tools, it can be the best tool for a less-than-ideal job.

For example, for the data shown in your question, you could use

substring-before(substring-after(//description, 'img src="'), '"')

In this case, the inner call substring-after(//description, 'img src="') returns pictureUrl.jpg" /></div>text for desc</div>, of which the substring before " is pictureUrl.jpg.

This isn't really robust, for example it'll fail if there's a space between src and =. But if the exact formatting is predictable, you'll be OK.

Upvotes: 1

Related Questions