Reputation: 149
I would like to extract img src value from an XML file.
Test input:
<ROOT>
<ITEM>
<DESCRIPTION><![CDATA[<p align="left" dir="ltr">
<span lang="EN">lorem ipsum</span></p>
<p>
some text</p>
<p>
<img alt="" src="https://example.com/hello.jpg" /></p>
]]></DESCRIPTION>
</ITEM>
</ROOT>
What would be the best way to do it? With XSLT or an XML parser, like xmllint?
Currently I am trying with xmllint:
xmllint --xpath '//ROOT/ITEM/DESCRIPTION/text()' input.xml | egrep -o 'src=".*(\.png|\.jpg)'
...but output is like:
src="https://example.com/hello.jpg
Sure I can remove src="
, with tools like sed, but maybe there is a better and cleaner solution to extract links?
Upvotes: 0
Views: 667
Reputation: 167716
You need to dig deep with XPath 3 or XSLT 3 throwing in parse-xml-fragment
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:output method="text" indent="yes" html-version="5"/>
<xsl:template match="/">
<xsl:value-of select="ROOT/ITEM/DESCRIPTION/parse-xml-fragment(.)//img/@src"/>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NSSEv7
Saxon 9.9 HE is available in .NET, Java and C/C++/Python versions to run/use XSLT 3.
If the CDATA contains HTML that is not well-formed X(HT)ML then you could use the HTML parser implemented by David Carlisle in XSLT 2 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:html-parser="data:,dpc"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="ROOT/ITEM/DESCRIPTION/html-parser:htmlparse(., '', true())//img/@src"/>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NSSEv7/1
Upvotes: 1
Reputation: 52888
If your CDATA section would be well-formed XML if it wasn't in a CDATA section, you could pipe the output of xmllint to xmllint so the CDATA is parsed as XML.
In your specific example, you have to wrap the output in another element (x
in example) to make it well-formed.
Example...
xmllint --xpath 'concat("<x>",string(//ROOT/ITEM/DESCRIPTION),"</x>")' input.xml | xmllint --xpath 'string(//img/@src)' -
Output...
https://example.com/hello.jpg
Upvotes: 1