POSTing XML with Chinese characters to the Microsoft Translator API raises deserializing exception

Question

I'm trying to translate from Chinese (Simplified) to English using the Microsoft Translator API.

A couple of requirements

I must use the HTTP method POST, and not GET with a query string because my queries exceed Microsoft's URI limit of 15,845 characters (note that this is possible even when I use less than the 10,000 characters limit in the case of Chinese characters. The reason is that the query string has to be URL encoded, which dramatically increases the length, but it is decoded by Microsoft before the character count is determined.
The only translate HTTP method that allows POSTs is the TranslateArrayMethod, e.g. the TranslateMethod only allows GETs. Unfortunately, the TranslateArrayMethod only accepts an XML document, so I must work with XML.

The following is an example of an XML document that I am sending:

This works fine, the result is:



    es
    
    4

Hello

5

However, if I then add any Chinese character, like so:


    
    zh-CHS
    
        text/plain
    
    
        
        
        
    
    en

I get a weird response:


    
    System.Runtime.Serialization.SerializationException
    Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 298.

Note that I also tried not using CDATA escaping, but it doesn't help. Changing the From language has no effect either.

I'm working with Node.js (Javascript), although since this is a generic HTTP API I don't think that should matter.

Teg Grenager · Accepted Answer

OK, I encountered exactly the same problem calling one of the Microsoft Translator POST APIs from Node.js. The API works fine - returns the translation as expected - as long as there are no non-ASCII characters, but then when I add a single accented 'é' character to the in appropriate section of the POST body, it responds with an error:

    System.Runtime.Serialization.SerializationException
Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 782.

I figured out that the problem is that the Content-Length header wants the length in bytes, but I had been sending the length in characters. Why does this happen? Well, the typical way to measure the length of the body for the Node http request is to call

var length = body.length

and get the "length" - i.e. number of characters - of the string. This works when all of the characters are ASCII. However, it turns out that in UTF-8 non-ASCII characters (including my accented 'é') can be more than one byte each. So when the body contains non-ASCII characters the byte length will no longer be equal to the character length, and the character length is incorrect. In this case, it caused the Microsoft server to stop reading the message prematurely, generating the error message.

Instead we need to measure the length in bytes with the call (in Node.js)

var length = Buffer.byteLength(body, 'utf8')

and send that length in Content-Length header, and the Microsoft Translator API works again.

POSTing XML with Chinese characters to the Microsoft Translator API raises deserializing exception

Answers (2)

Related Questions