Reputation: 77
I'd like to parse a html document with XMLStarlet which worked well in the past, but due to changes of the underlying content generator keeps throwing up errors.
I now receive more than two dozen error messages such as
-:157.22: Namespace prefix xlink for href on use is not defined
<use xlink:href="#menu"/>
because of newly embedded SVG images containing use xlink:href
tags. The corresponding namespace for the prefix xlink ought to be "http://www.w3.org/1999/xlink", which I added to the command segment in a first step
(...) | xml.exe sel -N n="http://www.w3.org/1999/xlink" -t -v "/html/body/div/div/div/main/ul/li[1]/h2/a/@href"
but apparently I didn't do it right, as the errors remain. I don't see any namespace declaration in the generated site content.
How do I fix the errors?
Update
The full command I'm working on:
wget -qO- "https://notepad-plus-plus.org/downloads/" | xml fo -H -Q | xml.exe sel -t -v "/html/body/div/div/div/main/ul/li[1]/h2/a/@href"
Irregularly I get the following error message, too:
Attempt to load network entity http://www.w3.org/TR/REC-html40/loose.dtd
-:3.1: Start tag expected, '<' not found
I assume there's another namespace conflict.
Upvotes: 1
Views: 178
Reputation: 189
Just came across this question because I had the same issue but i solved it using this method: Before piping the XML to xmlstarlet, run it through
sed -r -e 's_<(/?)\w+:_<\1_g' -e 's_\sxmlns:[^[:blank:]>]+__g'
This will get rid of all the namespace tags and references
Upvotes: 0
Reputation: 29042
The error message from xmlstarlet
-:157.22: Namespace prefix xlink for href on use is not defined
<use xlink:href="#menu"/>
refers to the HTML file and not to your XPath expression. It seems that the HTML file is invalid. Browsers usually do ignore that, but for xmlstarlet it seems to be a problem.
One way to fix this is by adding a namespace declaration on a super-element of the <use xlink:href="#menu"/>
element. I chose the <body>
element for simplicity:
<body xmlns:xlink="http://www.w3.org/1999/xlink">
Then you didn't even need the namespace declaration in the xmlstarlet command, because you didn't refer to any element in a namespace with your XPath expression.
But how to correct the HTML is another thing.
A quick and dirty would be using 'sed' - which is usually a bad idea and considered bad practice, but in this simple scenario it could be sufficient.
You could insert
sed -e 's/<body>/<body xmlns:xlink="http:\/\/www.w3.org\/1999\/xlink">/g'
in your command between wget and xml fo.
Upvotes: 0