Extract img src from cdata text in XML

I would like to extract img src value from an XML file.

Test input:

<ROOT>
   <ITEM>
      <DESCRIPTION><![CDATA[<p align="left" dir="ltr">
    <span lang="EN">lorem ipsum</span></p>
<p>
    some text</p>
<p>
    <img alt="" src="https://example.com/hello.jpg" /></p>
]]></DESCRIPTION>
    </ITEM>
</ROOT>         

What would be the best way to do it? With XSLT or an XML parser, like xmllint?

Currently I am trying with xmllint:

xmllint --xpath '//ROOT/ITEM/DESCRIPTION/text()' input.xml | egrep -o 'src=".*(\.png|\.jpg)'

...but output is like:

src="https://example.com/hello.jpg

Sure I can remove src=", with tools like sed, but maybe there is a better and cleaner solution to extract links?

Upvotes: 0

Views: 667

Answers (2)

Martin Honnen
Martin Honnen

Reputation: 167716

You need to dig deep with XPath 3 or XSLT 3 throwing in parse-xml-fragment:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">    

  <xsl:output method="text" indent="yes" html-version="5"/>

  <xsl:template match="/">
     <xsl:value-of select="ROOT/ITEM/DESCRIPTION/parse-xml-fragment(.)//img/@src"/>
  </xsl:template>

</xsl:stylesheet>

https://xsltfiddle.liberty-development.net/3NSSEv7

Saxon 9.9 HE is available in .NET, Java and C/C++/Python versions to run/use XSLT 3.

If the CDATA contains HTML that is not well-formed X(HT)ML then you could use the HTML parser implemented by David Carlisle in XSLT 2 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:html-parser="data:,dpc"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

  <xsl:output method="text"/>

  <xsl:template match="/">
     <xsl:value-of select="ROOT/ITEM/DESCRIPTION/html-parser:htmlparse(., '', true())//img/@src"/>
  </xsl:template>

</xsl:stylesheet>

https://xsltfiddle.liberty-development.net/3NSSEv7/1

Upvotes: 1

Daniel Haley
Daniel Haley

Reputation: 52888

If your CDATA section would be well-formed XML if it wasn't in a CDATA section, you could pipe the output of xmllint to xmllint so the CDATA is parsed as XML.

In your specific example, you have to wrap the output in another element (x in example) to make it well-formed.

Example...

xmllint --xpath 'concat("<x>",string(//ROOT/ITEM/DESCRIPTION),"</x>")' input.xml | xmllint --xpath 'string(//img/@src)' -

Output...

https://example.com/hello.jpg

Upvotes: 1

Related Questions