curious dog
curious dog

Reputation: 141

Decoding UTF Issue?

I am working on my android project & i have an exotic problem which makes me crazy. I am trying to convert a String to Utf-16 or Utf-8. I use this piece of code to achieve it but it gives me an array with some negative members!

Java Code :

String Tag="سیر";
String Value="";
try{
            byte[] bytes = Tag.getBytes("UTF-16");
            for(int i=0;i<bytes.length;i++){
            Value=Value+String.valueOf(bytes[i])+",";
        }

Array members : Array members are [-1,-2,51,6,-52,6,49,6]. I checked the UTF-16's table . It doesn't have any negative number and also I used a website which converts words to UTF-16M. It gave me "0633 06CC 0631"HEX. If you change this number to decimal you will see this: "1577 1740 1585". as you see there is no negative number here! So my first question is what are these negative numbers?!

Why do i want to convert a word to UTF-8 or UTF-16 ?

I am working on a project . this project hast two parts. First part is an android application which sends key words to the server. The words are sent by clients. My clients use (persian,فارسی ) characters. the second part is a web application which is made by C# & it has to response to my clients .

Problem: When I send these words to the server it works on a stream of "????" instead of the correct word. I have tried many ways to solve this problem but they couldn't solve it. after that i decided to send the utf-16 or utf-8 of string myself to the server and convert it to the correct word. So I chose those method which i described at the top of my post.

Is my original code reliable?

Yes it is. If I use the English characters it responses very well.

What are my original codes ?

Java codes which send parameter to the server :

    protected String doInBackground(String...Urls){
                String Data="";
                HttpURLConnection urlConnection = null; 
                try{
                    URL myUrl=new URL("http://10.0.2.2:80/Urgence/SearchResault.aspx?Tag="+Tag);
                    urlConnection = (HttpURLConnection)myUrl.openConnection();      
                    BufferedReader in = new BufferedReader (new InputStreamReader(urlConnection.getInputStream()));         
                    String temp=""; 
                    // Data is used to store Server's Response 
                    while((temp=in.readLine())!=null)
                    {               
                         Data=Data+temp;        
                    }    
                }

C# codes which response to the clients :

    string Tag = Request.QueryString["Tag"].ToString();
    SqlConnection con = new SqlConnection(WebConfigurationManager.ConnectionStrings["conStr"].ToString();
            SqlCommand cmd = new SqlCommand("FetchResaultByTag");
            cmd.CommandType = CommandType.StoredProcedure;
            cmd.Parameters.AddWithValue("@NewsTag",Tag);
            cmd.Connection = con;
            SqlDataReader DR;
            String Txt = "";
            try
            {
                con.Open();
                DR = cmd.ExecuteReader();
                while (DR.Read())
                {
                    Txt = Txt + DR.GetString(0) + "-" + DR.GetString(1) + "-" + DR.GetString(2) + "-" + DR.GetString(3) + "/";
                }
                //Response.Write(Txt);
                con.Close();
            }
            catch (Exception ex)
            {
                con.Close();
                Response.Write(ex.ToString());
            }

*What do you think ? do you have any idea ?**

Upvotes: 5

Views: 931

Answers (2)

curious dog
curious dog

Reputation: 141

I solved it . at first i changed my java code.i converted my String to UTF-8 by using of URLEncoder class.

new java Code :

try{
            Tag=URLEncoder.encode(Tag,"UTF-8");
            }
            catch(Exception ex){
                Log.d("Er>encodeing-Problem",ex.toString());     
            } 

after that i sent it as a query String via Http Protocol

protected String doInBackground(String...Urls){
            String Data="";
            HttpURLConnection urlConnection = null; 
            try{
                URL myUrl=new URL("http://10.0.2.2:80/Urgence/SearchResault.aspx?Tag="+Tag);
                urlConnection = (HttpURLConnection)myUrl.openConnection();      
                BufferedReader in = new BufferedReader (new InputStreamReader(urlConnection.getInputStream()));         
                String temp=""; 
                // Data is used to store Server's Response 
                while((temp=in.readLine())!=null)
                {               
                     Data=Data+temp;        
                }  

and at the end i Caught in the server and decoded it .

new C# code :

     string Tag = Request.QueryString["Tag"].ToString();
     SqlConnection con = new SqlConnection(WebConfigurationManager.ConnectionStrings["conStr"].ToString());
            SqlCommand cmd = new SqlCommand("FetchResaultByTag");
            cmd.CommandType = CommandType.StoredProcedure;
            cmd.Parameters.AddWithValue("@NewsTag",   HttpUtility.UrlDecode(Tag));
cmd.Connection = con;
        SqlDataReader DR;
        String Txt = "";
        try
        {
            con.Open();
            DR = cmd.ExecuteReader();
            while (DR.Read())
            {
                Txt = Txt + DR.GetString(0) + "-" + DR.GetString(1) + "-" + DR.GetString(2) + "-" + DR.GetString(3) + "/";
            }
            Response.Write(Txt);
            con.Close();
        }
        catch (Exception ex)
        {
            con.Close();
            Response.Write(ex.ToString());
        }

Upvotes: 3

Peter Duniho
Peter Duniho

Reputation: 70671

my first question is what are these negative numbers?!

They are the signed-byte representation of the individual bytes within each 16-bit value of your text. In Java, the byte type is a signed value, similar to int or long, but having only 8 bits of information. It can represent values anywhere from -128 to 127. They are only "negative" when interpreted as a Java byte value.

Of course, as bytes within UTF16-encoded text, that interpretation is meaningless. You are supposed to be interpreting them only as UTF16-encoded text. But the negative numbers are the natural result of misinterpreting UTF16-encoded text as if it were just a plain array of signed bytes.

It's similar to as if you'd done something like int i = -1; uint j = (uint)i; (in C#...Java does not have unsigned integer types per se) and then asked why j isn't negative, and instead has the value 4,294,967,295. Well, j is an unsigned data type; the bit pattern used for -1 as a signed int is the same used for 4,294,967,295 as an unsigned uint.

If that previous paragraph doesn't make sense to you, then you will need to do some reading on your own to learn how computers store numbers and what the difference between signed and unsigned data types is.


The output array of your code, [-1,-2,51,6,-52,6,49,6], is actually four 16-bit values, in little-endian byte order: 0xFEFF, 0x0633, 0x06CC, and 0x0631. Each of those 16-bit values represents a Unicode code-point.

The first one is used as a byte-order-mark for UTF16-encoded text. It is a Unicode character that is specifically used to indicate whether the bytes in the UTF16 encoding are little-endian or big-endian. The other three are the characters from your actual string.

But when you pull the bytes apart and look at them individually, if viewed as signed byte values, any value larger than 0x7F (when considered as unsigned byte values) represents a negative number as a signed byte value. So, the 0xFF, 0xFE, and 0xCC all are displayed as negative numbers (each of those being larger than 0x7F). But they really are still just each half of a valid 16-bit Unicode code point value.

Note that even those code point values could appear negative if interpreted incorrectly. In your example, just one would appear negative — 0xFEFF is -257 when interpreted as a signed 16-bit value, even though the Unicode code point is actually decimal 65279 — but there are plenty of other Unicode characters that have a value higher than 0x7FFFF (decimal 32767), and would appear negative if viewed as a signed 16-bit value.

The bottom line is that computers don't really know anything about numbers. They just have bits, conveniently grouped into bytes, and various word sizes. When you want to know what those bits mean, you have to make sure you tell the computer the correct, useful representation to use when showing the bits to you. If you don't, then you get some other interpretation of those bits that doesn't match their intended representation. Garbage in, garbage out.


Now, assuming you understood all of the above, let's consider your broader question:

When I send these words to the server it works on a stream of "????" instead of the correct word. I have tried many ways to solve this problem but they couldn't solve it.

The first question to ask yourself is "how am I interpreting these bytes? how am I displaying them to the user?" You didn't share any of the code that was actually relevant in this respect, but you did say that when you use only the Latin alphabet ("English characters") it works fine. Assuming you tested the Latin alphabet scenario with UTF16 as well, then this tells me that the basic I/O is working correctly; the main thing you could get wrong is the byte order, but if that were happening, even the Latin characters would be garbled.

So the most likely explanation for the "????" you describe is that you are simply not viewing the text in a context where Persian characters can be displayed. For example, writing them out to a console window using the Console class. The font used in the console window doesn't support Unicode-aware rendering, so it's just not going to show Persian characters. There are similar issues in various other contexts, including e.g. Notepad (depending on what font is in use) and other editors.


So, sorry. All of the above is really just a lengthy way of saying to you "everything is fine, you're probably just not using the right tool to validate your results."

Note that without a good, minimal, complete code example that reliably reproduces whatever problem you perceive, it's not really possible to say for sure what's going on. If after reading and understanding this answer, you still believe there's something wrong with your code, you need to take the time to create a good code example that would clearly demonstrate the actual problem. A single line of code is worth a thousand words, and a proper code example is worth its weight in gold (to mix a couple of completely inapplicable metaphors :) ).

Upvotes: 1

Related Questions