dabbl0r
dabbl0r

Reputation: 77

Namespace worries while html parsing with XMLStarlet

I'd like to parse a html document with XMLStarlet which worked well in the past, but due to changes of the underlying content generator keeps throwing up errors.

I now receive more than two dozen error messages such as

-:157.22: Namespace prefix xlink for href on use is not defined
  <use xlink:href="#menu"/>

because of newly embedded SVG images containing use xlink:href tags. The corresponding namespace for the prefix xlink ought to be "http://www.w3.org/1999/xlink", which I added to the command segment in a first step

(...) | xml.exe sel -N n="http://www.w3.org/1999/xlink" -t -v "/html/body/div/div/div/main/ul/li[1]/h2/a/@href"

but apparently I didn't do it right, as the errors remain. I don't see any namespace declaration in the generated site content.

How do I fix the errors?

Update

The full command I'm working on:

wget -qO- "https://notepad-plus-plus.org/downloads/" | xml fo -H -Q | xml.exe sel -t -v "/html/body/div/div/div/main/ul/li[1]/h2/a/@href"

Irregularly I get the following error message, too:

Attempt to load network entity http://www.w3.org/TR/REC-html40/loose.dtd
-:3.1: Start tag expected, '<' not found

I assume there's another namespace conflict.

Upvotes: 1

Views: 178

Answers (2)

Sebastian
Sebastian

Reputation: 189

Just came across this question because I had the same issue but i solved it using this method: Before piping the XML to xmlstarlet, run it through

sed -r -e 's_<(/?)\w+:_<\1_g' -e 's_\sxmlns:[^[:blank:]>]+__g'

This will get rid of all the namespace tags and references

Upvotes: 0

zx485
zx485

Reputation: 29042

The error message from xmlstarlet

-:157.22: Namespace prefix xlink for href on use is not defined
<use xlink:href="#menu"/>

refers to the HTML file and not to your XPath expression. It seems that the HTML file is invalid. Browsers usually do ignore that, but for xmlstarlet it seems to be a problem.

One way to fix this is by adding a namespace declaration on a super-element of the <use xlink:href="#menu"/> element. I chose the <body> element for simplicity:

<body xmlns:xlink="http://www.w3.org/1999/xlink">

Then you didn't even need the namespace declaration in the xmlstarlet command, because you didn't refer to any element in a namespace with your XPath expression.

But how to correct the HTML is another thing.
A quick and dirty would be using 'sed' - which is usually a bad idea and considered bad practice, but in this simple scenario it could be sufficient.

You could insert

sed -e 's/<body>/<body xmlns:xlink="http:\/\/www.w3.org\/1999\/xlink">/g'

in your command between wget and xml fo.

Upvotes: 0

Related Questions