jstck
jstck

Reputation: 501

Parsing UTF-8-encoded XML in MSXML/ASP

I'm at the receiving end of a HTTP POST (x-www-form-urlencoded), where one of the fields contains an XML document. I need to receive that document, look at a couple of elements, and store it in a database (for later use). The document is in UTF-8 format (and has the appropriate header), and can contain lots of strange characters.

When I receive the data, like this:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.3.0")
xmlDoc.async = False
xmlDoc.loadXML(Request.Form("xml"))

everything I can dig out of the DOM document is still in UTF-8 form. For example, this document (grossly simplified):

<?xml version="1.0" encoding="UTF-8"?>
<data>
 ä
</data>

always comes out as

<?xml version="1.0" encoding="UTF-8"?>
<data>
 ä
</data>

If I look at xmlDoc.XML, I get this:

<?xml version="1.0"?>
<data>
 ä
</data>

It removes the encoding from the header (since whatever string I'm using in VBScript is "encoding-agnostic", this sort of makes sense), but it's still a sequence of characters representing an UTF-8 encoded document.

It's just as if MSXML didn't care about the encoding info in the header. Is the problem with MSXML, or is it with the encoding of the post data? It's a form of "double encoding", first UTF-8 (where certain characters are written with several bytes) and then urlencoded byte by byte ("ä" is actually sent as %C3%A4).

I would not want to hard-code anything such as assuming that it is always UTF-8 (as it could well be UTF-16 sometime in the future). I cannot do a "hard conversion" to any other character set either (such as iso-8859-1), as the data can contain cyrillic and arabic characters. How should I go about fixing this?

Upvotes: 1

Views: 3376

Answers (2)

AnthonyWJones
AnthonyWJones

Reputation: 189439

Option 1

Before reading any form fields modify your Response.CodePage value:-

Response.CodePage = 65001

The problem is the content of the form data is not understood by the receiving page to be UTF-8 Encoded. Hence the %C3%A4 data is seen as two distinct ANSI characters. The pages Response.CodePage weirdly influences how the form data is decode in the absence of character set info sent by the client.

Option 2

Modify the form element on the source page. Add the following attribute to to it:-

<form accept-charset="UTF-8" ...>

This enforces UTF-8 encoding of the characters in the post, and causes the post to carry data about the chosen charset, which gives the server the info it needs to decode the data correctly.

Option 3

Finally, my personal preference, don't post XML as field values in a form. Instead, turn it around, by adding the other form field values as attributes or elements to the XML then post the XML using XmlHttpRequest. For navigation have the server return a URL to which the client should navigate that would contain a GUID handle to the posted data so that when the server receives the request it can take the appropriate action. I realize however, that this is all quite a bit more work, in which case, one of the other two options should work for you.

Upvotes: 3

ionn
ionn

Reputation: 1583

Option 3 can be pretty much ruled out at the moment due to the added complexity of such a rewrite.

Option 1 just seems strange to me, that the codepage of the response should dictate what happens with the request, but if that's the way it is, then so be it.

As for option 2, it's not really a browser form posting, but a small script client (using CURL). What would be the the resulting HTTP header sent from that, that could be added to the scripted request?

In all, I guess this means that MSXML simply ignores whatever encoding is set in the xml header when loading from a string.

Upvotes: 0

Related Questions