Extract img src from cdata text in XML

Question

I would like to extract img src value from an XML file.

Test input:


   
      
    lorem ipsum

    some text

    
]]>

What would be the best way to do it? With XSLT or an XML parser, like xmllint?

Currently I am trying with xmllint:

xmllint --xpath '//ROOT/ITEM/DESCRIPTION/text()' input.xml | egrep -o 'src=".*(\.png|\.jpg)'

...but output is like:

src="https://example.com/hello.jpg

Sure I can remove src=", with tools like sed, but maybe there is a better and cleaner solution to extract links?

Martin Honnen · Accepted Answer

You need to dig deep with XPath 3 or XSLT 3 throwing in parse-xml-fragment:

Saxon 9.9 HE is available in .NET, Java and C/C++/Python versions to run/use XSLT 3.

If the CDATA contains HTML that is not well-formed X(HT)ML then you could use the HTML parser implemented by David Carlisle in XSLT 2 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl):

https://xsltfiddle.liberty-development.net/3NSSEv7/1

Answers (2)