RuSh
RuSh

Reputation: 1693

ASP.NET - Unable to translate Unicode character XXX at index YYY to specified code page

On a ASP.NET 4 website and im getting the following error when trying to load data from the database into a GridView.

Unable to translate Unicode character \uD83D at index 49 to specified code page.

I've found out that this happens when a data row contains: Text Text Text 😊😊

As i understand this text cannot be translated into a valid utf-8 response.

  1. Is that really the reason?

  2. Is there a way to clean the text before loading it into the gridview to prevent such errors?


UPDATE:

I have some progress I've found out I only get this error when I'm using Substring method on a string. (I'm using substring to show part of the text as a preview to the user).

For example in an ASP.NET Web Form I do this:

String txt = test 💔💔;

//txt string can also be created by 
String txt = char.ConvertFromUtf32(116) + char.ConvertFromUtf32(101) +char.ConvertFromUtf32(115) + char.ConvertFromUtf32(116) + char.ConvertFromUtf32(32) + char.ConvertFromUtf32(128148);

// this works ok txt is shown in the webform label.
Label1.Text = txt; 

//length is equal to 7.
Label2.Text = txt.Length.ToString();

//causes exception - Unable to translate Unicode character \uD83D at index 5 to specified code page.
Label3.Text = txt.Substring(0, 6);

I know that .NET string is based on utf-16 which supports surrogate pairs.

When i'm using SubString function I accidently break the surrogate pair and causes the exception. I found out that I can use StringInfo class:

var si = new System.Globalization.StringInfo(txt);
var l = si.LengthInTextElements; // length is equal to 6.
Label3.Text = si.SubstringByTextElements(0, 5); //no exception!

Another alternative is to just delete the surrogate pairs :

Label3.Text = ValidateUtf8(txt).Substring(0, 3); //no exception!

    public static string ValidateUtf8(string txt)
            {
                StringBuilder sbOutput = new StringBuilder();
                char ch;

                for (int i = 0; i < body.Length; i++)
                {
                    ch = body[i];
                    if ((ch >= 0x0020 && ch <= 0xD7FF) ||
                            (ch >= 0xE000 && ch <= 0xFFFD) ||
                            ch == 0x0009 ||
                            ch == 0x000A ||
                            ch == 0x000D)
                    {
                        sbOutput.Append(ch);
                    }

                }
                return sbOutput.ToString();
            }

Is this really a problem of surrogate pairs?

Which characters use surrogate pairs ? is there a list?

Should I keep support for surrogate pairs? should i go with using StringInfo Class or just delete non valid chars?

Thanks!

Upvotes: 24

Views: 30408

Answers (3)

Dave Bish
Dave Bish

Reputation: 19646

I have just found out that Application Request Routing if installed in IIS 7.5 will force %2f to be handled differently, thus causing issues.

Removing ARR solved this issue for us.

Upvotes: 0

LaserJesus
LaserJesus

Reputation: 8540

You could try encoding the text to UTF8 first (in the row bound event or something similar). The following code will encode text in UTF8 and remove un-encodable characters.

private static readonly Encoding Utf8Encoder = Encoding.GetEncoding(
    "UTF-8",
    new EncoderReplacementFallback(string.Empty),
    new DecoderExceptionFallback()
);

var utf8Text = Utf8Encoder.GetString(Utf8Encoder.GetBytes(text));

Upvotes: 29

devio
devio

Reputation: 37205

Character U+1F60A is an emoji character introduced in Unicode 6.0. Its UTF-16 representation (SQL Server (you did not mention the database you are using) uses the similar UCS-2) is 0xD83D 0xDE0A using surrogate characters.

Since Unicode 6.0 was released in Oct 2010, my guess is that either SQL Server, or (ASP).Net 4, or the conversion between SQL Server data and .Net data do not support the emoji code points.

Upvotes: 0

Related Questions