Reputation: 8127
I'm trying to translate from Chinese (Simplified) to English using the Microsoft Translator API.
A couple of requirements
I must use the HTTP method POST
, and not GET
with a query string because my queries exceed Microsoft's URI limit of 15,845 characters (note that this is possible even when I use less than the 10,000 characters limit in the case of Chinese characters. The reason is that the query string has to be URL encoded, which dramatically increases the length, but it is decoded by Microsoft before the character count is determined.
The only translate HTTP method that allows POST
s is the TranslateArrayMethod
, e.g. the TranslateMethod
only allows GET
s. Unfortunately, the TranslateArrayMethod
only accepts an XML document, so I must work with XML.
The following is an example of an XML document that I am sending:
<TranslateArrayRequest>
<AppId/>
<From>es</From>
<Options>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<![CDATA[Hola]]>
</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>
This works fine, the result is:
<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
<From>es</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:int>4</a:int>
</OriginalTextSentenceLengths>
<TranslatedText>Hello</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:int>5</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
</ArrayOfTranslateArrayResponse>
However, if I then add any Chinese character, like so:
<TranslateArrayRequest>
<AppId/>
<From>zh-CHS</From>
<Options>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<![CDATA[南]]>
</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>
I get a weird response:
<html>
<body/>
<h1>System.Runtime.Serialization.SerializationException</h1>
<p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 298.</p>
</html>
Note that I also tried not using CDATA escaping, but it doesn't help. Changing the From
language has no effect either.
I'm working with Node.js (Javascript), although since this is a generic HTTP API I don't think that should matter.
Upvotes: 1
Views: 551
Reputation: 305
OK, I encountered exactly the same problem calling one of the Microsoft Translator POST APIs from Node.js. The API works fine - returns the translation as expected - as long as there are no non-ASCII characters, but then when I add a single accented 'é' character to the in appropriate <string>
section of the POST body, it responds with an error:
<html><body/><h1>System.Runtime.Serialization.SerializationException</h1>
<p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 782.</p>
</html>
I figured out that the problem is that the Content-Length
header wants the length in bytes, but I had been sending the length in characters. Why does this happen? Well, the typical way to measure the length of the body for the Node http request is to call
var length = body.length
and get the "length" - i.e. number of characters - of the string. This works when all of the characters are ASCII. However, it turns out that in UTF-8 non-ASCII characters (including my accented 'é') can be more than one byte each. So when the body contains non-ASCII characters the byte length will no longer be equal to the character length, and the character length is incorrect. In this case, it caused the Microsoft server to stop reading the message prematurely, generating the error message.
Instead we need to measure the length in bytes with the call (in Node.js)
var length = Buffer.byteLength(body, 'utf8')
and send that length in Content-Length
header, and the Microsoft Translator API works again.
Upvotes: 1
Reputation: 301
Most probably, the problem is not the Chinese language, but that MS Translator doesn't like new line symbols. When I stumbled into this error message, I've changed following:
In every content of <string> node replaced XML reserved words with their alternative representation:
& → &
< → <
> → >
' → '
" → "
After that, all worked smoothly. Concerning your particular example, the symbol "南" was translated as "South". I didn't use CDATA escaping.
Upvotes: 1