Reputation: 21
I'm using HTML Tidy (version 5.8.0) to clean up web-sourced HTML for transforming with XSLT.
This generally works fine until I discovered a page where 2 <use/>
elements have identical repeated xmlns:xlink definitions.
example.html:
<!DOCTYPE html>
<html>
<head>
<title>title</title>
</head>
<body>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
</svg>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
</svg>
</body>
</html>
This goes unremarked when run through HTML Tidy:
$ tidy --output-xhtml 1 example.html > /dev/null
Info: Document content looks like XHTML 1.0 Transitional
No warnings or errors were found.
But it fatally trips up my XSLT processor and xmllint for the identical reason:
EDIT To clarify, I'm using HTML Tidy via lib bindings within a Python script and the example HTML above is heavily simplified to demonstrate the exact issue only. Use of command-line HTML Tidy here is for ease of replication. I don't use xmllint in the script, but it is the simplest way to demonstrate the exact error shown by my XSL processor.
$ xmllint example.html
example.html:8: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
^
example.html:11: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
^
I would expect that HTML Tidy would normalise duplicated xmlns:xlink...
attributes as it does for normal attributes (it keeps either first or last).
Stripping out the duplicated xmlns:xlink...
is simple with sed
but it is still double-handling.
Am I simply missing the correct Tidy option?
The following options make no difference:
--output-xml 1
--drop-proprietary-attributes 1
--doctype strict
--strict-tags-attributes 1
--repeated-attributes keep-first
I suspect this is a bug in Tidy and will submit it if there is no simple explanation.
EDIT I understand that this is easily fixable with additional tools, the main point of this question is to establish whether HTML Tidy should be noticing/fixing this error or if it is beyond its scope.
Desired output from HTML Tidy:
<!DOCTYPE html>
<html>
<head>
<title>title</title>
</head>
<body>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
</svg>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
</svg>
</body>
</html>
Upvotes: 1
Views: 102
Reputation: 12662
UPDATE: python only solution using lxml
which is based on libxml2
as xmllint
from lxml import etree
parser = etree.XMLParser(recover = True)
tree = etree.parse("/home/lmc/tmp/test.html", parser=parser)
print(etree.tostring(tree, pretty_print = True).decode('utf-8'))
Result:
<!DOCTYPE html>
<html>
<head>
<title>title</title>
</head>
<body>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
</svg>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
</svg>
</body>
</html>
Using xmllint --recover
will show the warning but strip the double namespace declaration (better option than using sed
).
xmllint --recover ~/tmp/test.html
Result
/home/lmc/tmp/test.html:8: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
^
/home/lmc/tmp/test.html:11: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
^
<?xml version="1.0"?>
<!DOCTYPE html>
<html>
<head>
<title>title</title>
</head>
<body>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
</svg>
<svg>
<use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
</svg>
</body>
</html>
Warning goes to stderr so it can be removed from output
xmllint --recover ~/tmp/test.html 2>/dev/null
To save the changes:
printf "%s\n" 'save' 'bye' | xmllint --recover --shell ~/tmp/test.html
xmlns:xlink=...
it's a namespace declaration and not a regular attribute so --repeated-attributes keep-first
won't work.
That namespace is used there on this attribute xlink:href="#gel-icon-no"
.
Using --shell
option to get qualified attributes:
printf "%s\n" 'setns xlink=http://www.w3.org/1999/xlink' 'cat //use/@xlink:href' 'bye' | xmllint --recover --shell ~/tmp/test.html 2>/dev/null
/ > setns xlink=http://www.w3.org/1999/xlink
/ > cat //use/@xlink:href
-------
xlink:href="#gel-icon-search"
-------
xlink:href="#gel-icon-no"
/ > bye
Upvotes: 0