mardu
mardu

Reputation:

Problem with libxml character enconding on win32

While parsing some html files with libxml the function xmlParseFile() returns that the code includes non UTF-8 characters How can i modify the default charset of the library to ISO-8859-1 ? Is there any other way to solve this ?

PS: The entire development is based on libxml and works in most cases so I can't switch to another library.

Upvotes: 0

Views: 237

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 595295

The encoding used for XML data must be specified in the XML's prolog. If no encoding is specified, W3's XML spec dictates that UTF-8 must be assumed instead.

Why are you using an XML parser for parsing HTML data? libxml has an HTML parser separate from its XML parser. Look at htmlParseFile() and related functions. Since HTML is not XML, there would be no XML prolog present to indicate the data encoding. HTML does have a <meta> tag available that can be used inside the <head> tag for that, though. libxml's HTML parser does look for that tag to determine the encoding, if not explicitally passed to htmlParseFile() directly.

Upvotes: 1

Related Questions