mike
mike

Reputation: 21

Redefined xmlns in HTML elements

I'm using HTML Tidy (version 5.8.0) to clean up web-sourced HTML for transforming with XSLT. This generally works fine until I discovered a page where 2 <use/> elements have identical repeated xmlns:xlink definitions.

example.html:

<!DOCTYPE html>
<html>
    <head>
        <title>title</title>
    </head>
    <body>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
        </svg>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
        </svg>
    </body>
</html>

This goes unremarked when run through HTML Tidy:

$ tidy --output-xhtml 1 example.html > /dev/null
Info: Document content looks like XHTML 1.0 Transitional
No warnings or errors were found.

But it fatally trips up my XSLT processor and xmllint for the identical reason:

EDIT To clarify, I'm using HTML Tidy via lib bindings within a Python script and the example HTML above is heavily simplified to demonstrate the exact issue only. Use of command-line HTML Tidy here is for ease of replication. I don't use xmllint in the script, but it is the simplest way to demonstrate the exact error shown by my XSL processor.

$ xmllint example.html
example.html:8: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
                                                                               ^
example.html:11: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
                                                                               ^

I would expect that HTML Tidy would normalise duplicated xmlns:xlink... attributes as it does for normal attributes (it keeps either first or last).

Stripping out the duplicated xmlns:xlink... is simple with sed but it is still double-handling.

Am I simply missing the correct Tidy option?

The following options make no difference:

--output-xml 1
--drop-proprietary-attributes 1
--doctype strict
--strict-tags-attributes 1
--repeated-attributes keep-first

I suspect this is a bug in Tidy and will submit it if there is no simple explanation.

EDIT I understand that this is easily fixable with additional tools, the main point of this question is to establish whether HTML Tidy should be noticing/fixing this error or if it is beyond its scope.

Desired output from HTML Tidy:

<!DOCTYPE html>
<html>
    <head>
        <title>title</title>
    </head>
    <body>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
        </svg>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
        </svg>
    </body>
</html>

Upvotes: 1

Views: 102

Answers (1)

LMC
LMC

Reputation: 12662

UPDATE: python only solution using lxml which is based on libxml2 as xmllint

from lxml import etree

parser = etree.XMLParser(recover = True)
tree = etree.parse("/home/lmc/tmp/test.html", parser=parser)
print(etree.tostring(tree, pretty_print = True).decode('utf-8'))

Result:

<!DOCTYPE html>
<html>
    <head>
        <title>title</title>
    </head>
    <body>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
        </svg>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
        </svg>
    </body>
</html>

Using xmllint --recover will show the warning but strip the double namespace declaration (better option than using sed).

xmllint --recover ~/tmp/test.html

Result

/home/lmc/tmp/test.html:8: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
                                                                               ^
/home/lmc/tmp/test.html:11: parser error : Attribute xmlns:xlink redefined
:xlink="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink"
                                                                               ^
<?xml version="1.0"?>
<!DOCTYPE html>
<html>
    <head>
        <title>title</title>
    </head>
    <body>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-search" href="#gel-icon-search" role="presentation"/>
        </svg>
        <svg>
            <use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#gel-icon-no" href="#gel-icon-no" role="presentation"/>
        </svg>
    </body>
</html>

Warning goes to stderr so it can be removed from output

xmllint --recover ~/tmp/test.html 2>/dev/null

To save the changes:

printf "%s\n" 'save' 'bye' | xmllint --recover --shell ~/tmp/test.html

xmlns:xlink=... it's a namespace declaration and not a regular attribute so --repeated-attributes keep-first won't work. That namespace is used there on this attribute xlink:href="#gel-icon-no".

Using --shell option to get qualified attributes:

printf "%s\n" 'setns xlink=http://www.w3.org/1999/xlink' 'cat //use/@xlink:href' 'bye' | xmllint --recover --shell ~/tmp/test.html 2>/dev/null
/ > setns xlink=http://www.w3.org/1999/xlink
/ > cat //use/@xlink:href
 -------
 xlink:href="#gel-icon-search"
 -------
 xlink:href="#gel-icon-no"
/ > bye

Upvotes: 0

Related Questions