user1424739
user1424739

Reputation: 13645

`-:55: HTML parser error : htmlParseEntityRef: expecting ';'`: clean up HTML file with xmllint?

http://journals.im.ac.cn/cjbcn/ch/reader/view_abstract.aspx?file_no=gc19010159&flag=1

I'd like to clean up the file from the above URL. But xmllint gives the following error. Does anybody know how to fix the problem? Thanks.

$ xmllint -html -xmlout file.html
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225413001&menu_id
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
on_item.aspx?parent_id=20070610225413001&menu_id=20070610225740001&is_three_menu
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225449001&menu_id
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
on_item.aspx?parent_id=20070610225449001&menu_id=20171222045531778&is_three_menu
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
ges/dh-img.jpg"><A href="../common_item.aspx?parent_id=20070610225428001&menu_id
                                                                               ^
-:55: HTML parser error : htmlParseEntityRef: expecting ';'
...

Upvotes: 0

Views: 720

Answers (2)

Tom O&#39;Hara
Tom O&#39;Hara

Reputation: 71

This is for future reference: it turns out that encoding the '&' as entity resolves the htmlParseEntityRef problem in the particular HTML file from the Chinese journal.

A simple example follows illustrating a workaround via perl:

$ cat bad-simple.html 
<!DOCTYPE HTML>
<html lang="en">
  <head>
    <title>bad URL links </title>
  </head>
  <body>
    <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
  </body>
</html>
$ 
$ xmllint --html --noout bad-simple.html
bad-simple.html:7: HTML parser error : htmlParseEntityRef: expecting ';'
    <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
                                            ^
bad-simple.html:7: HTML parser error : htmlParseEntityRef: expecting ';'
    <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
                                                  ^
$ perl -pe 's/\&(?!amp)/&amp;/g;' bad-simple.html >| better-simple.html
$ xmllint --html --noout better-simple.html
$ 
$ diff bad-simple.html better-simple.html 
7c7
<     <a href="http://www.fubar.com?fubar=1&fu=0&bar=0">fubar</a>
---
>     <a href="http://www.fubar.com?fubar=1&amp;fu=0&amp;bar=0">fubar</a>

Upvotes: 0

imhotap
imhotap

Reputation: 2490

That seems to be a problem with the ampersand character used in URLs with query parameters which xmllint wants to interpret as entity reference, and then complains about because entity references in XML must be terminated by a semicolon character (unlike in SGML where a semicolon is required only if subsequent characters are name characters). You could try xmllint's "-noent" option, but I don't believe xmllint can be told to ignore entity references and suggest to use another tool to convert HTML into XML such as "sgmlproc" as described in my Parsing HTML tutorial. Dealing with ampersand chars is discussed in detail there and involves using an HTML DTD where href and other URL-typed attributes are declared such that no entity references are recognized.

Sorry for the long answer and self-promotion, but I know of no better solution for your problem. I originally intended this to be a comment but ran out of space.

Upvotes: 0

Related Questions