Reputation: 95

How to use utf8 with javascript, javaservlets, mysql and back?

I am trying to handle hebrew characters on my app. My app is built as following:

A ui with java servlets, jsp.
A server with java servlets, mysql.

What my app does is get data through the UI, make a javascript object, use JSON.stringify to turn it into a JSON string and send it with XMLHttpRequest with xhr.send("data=".concat(jsonString)); Then the JavaScript code sends the jsonString to the ui servlet that forwards it to the server's servlet, which saves it top the db with hibernate api.

I am stuck with this hebrew issue for a while, so during research on the

web what I've got to is:

My JSP files start with

<%@page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>

and have

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

inside the <head> tag.

Inside the javascript constructors I use encodeURIComponent() on the fields that might have hebrew characters.
I have on both UI servlets and Server servlets filters that set character encoding to utf-8 if null.
I call the constructor for the db object (I'm using hibernate) with new String(originalString.toBytes() , "UTF8") where originalString are strings that might have hebrew characters.
In my persistence.xml file I have

<property name="hibernate.connection.CharSet" value="utf8mb4" /> <property name="hibernate.connection.characterEncoding" value="utf8" /> <property name="hibernate.connection.useUnicode" value="true" />

all set.

In eclipse I've set project->properties->resource->Text file encoding set to UTF8.
I've tried using xhr.overrideMimeType("UTF-8") and xhr.setRequestHeader("charset" , "utf-8") but they didn't help so I commented them out.

I thing that's it. I actually have a feeling I've made a bit of a mess....

Now, when I try on saving hebrew characters on the db through the ui:

when I do s.o.p on the ui servlets I get this kind of stuff: "×××¢" instead of hebrew chars. The same when I try to show the habrew chars back on the UI.
when I do s.o.p on the server servlets I get this kind of stuff: "Ã\u0097Â\u0092Ã\u0097Â\u0096Ã\u0097Â¢"
On mysql workbench I see A's with signs on top of them with small squares with 4 digits inside them.

I would very much like to be able to view hebrew chars in both mysql workbench and my UI.

Thank you!

------------------EDIT---------------------

I've added to my servlets

request.setCharacterEncoding("UTF-8");

and now I get hebrew chars in my ui servlets.

the ui servlets forward the request to the server servlets with the code below, which I've been trying to debug for the last few hours, but with no success. I think the problem might be here:

public static String forwardToServer(String servletName , 
                                         Map<String, Object> params , 
                                         String encoding , String method , 
                                         HttpSession session) {
        try {
            URL url = new URL(settings.LocationSettings.SERVER_ADDRESS.concat(servletName));
            StringBuilder postData = new StringBuilder();
            for (Map.Entry<String,Object> param : params.entrySet()) {
                if (postData.length() != 0) postData.append('&');
                /*postData.append(URLEncoder.encode(param.getKey(), encoding));
                postData.append('=');
                postData.append(URLEncoder.encode(String.valueOf(param.getValue()), encoding));
               */
                postData.append(param.getKey());
                postData.append('=');
                postData.append(String.valueOf(param.getValue()));
            }
            System.out.println("postData = " + postData.toString());
            byte[] postDataBytes = postData.toString().getBytes(encoding);
            System.out.println("postDataBytes.toString() = " + new String(postDataBytes));
            byte[] postDataBytes2 = postData.toString().getBytes();
            System.out.println("postDataBytes2.toString() = " + new String(postDataBytes2));




            HttpURLConnection conn = (HttpURLConnection)url.openConnection();

            String mySessionCookie = "JSESSIONID="+session.getAttribute(Login.SERVER_SESSION_ID_ATT_NAME);
            conn.setRequestMethod(method);
            conn.setRequestProperty("Cookie", mySessionCookie);
            conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
            conn.setRequestProperty("Content-Length", String.valueOf(postDataBytes.length));
            conn.setRequestProperty("charset" , "utf-8");
            conn.setDoOutput(true);

            if (postDataBytes != null && postDataBytes.length > 0) {
                BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(conn.getOutputStream(), "UTF-8"));
                bw.write(postData.toString());
                bw.flush();
                bw.close();

                //conn.getOutputStream().write(postDataBytes);
            }



            Reader in = new BufferedReader(new InputStreamReader(conn.getInputStream(), encoding));
            StringBuilder sb = new StringBuilder("");
            for (int c; (c = in.read()) >= 0;) {
                sb.append((char)c);
            }
            return sb.toString();
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (ProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } 
        return null;
    }

The first commented out part ( /*postData.append ..... encoding));*/ ) was part of my debugging, and System.out.println("postData = " + postData.toString()); shows the exact same thing in both cases (The hebrew chars are shown correctly)

Also the two System.out.println("postDataBytes.... show the same thing (Hebrew chars correctly).

This //conn.getOutputStream().write(postDataBytes); commented out code was my previous version (up until a few hours ago), and while debugging I changed it to the existing.

Now what appears in the ui servlets as

"race":"לול","flockId":"לול"

appears in the server as:

"race":"×\u009c×\u0095×\u009c","flockId":"×\u009c×\u0095×\u009c"

(when calling s.o.p)

And now I'm stuck again.....

----------------------EDIT2--------------------------

In order to try and understand where exactly is the problem, I sent the HTTP post request directly to the server's servlet. When doing that, I) still get this:

"race":"×\u009c×\u0095×\u009c","flockId":"×\u009c×\u0095×\u009c"

what means the problem is in the server's servlet. Only I can't find what exactly the problem is.
Like I wrote before, I call request.setCharacterEncoding("UTF-8"); in doPost(HttpServletRequest request, HttpServletResponse response).

Any ideas?

Upvotes: 0

Answers (2)

Ma'or

Reputation: 95

So.....

Problem solved!!!!

I am not sure what the problem was, but what solved it was switching the order of setCharacterEncode("UTF-8"); and request.getMapParameter();

instead of:

Map<String, String[]> map = request.getParameterMap();
request.setCharacterEncoding("UTF-8");

I now have:

request.setCharacterEncoding("UTF-8");          
Map<String, String[]> map = request.getParameterMap();

and that solved the problem...

I don't really know to explain it, iuf anyone does, I'll be happy to know.

Also, encodeURIComponent in the javaScript constructors was unnecessary.

Upvotes: 0

Rick James

Reputation: 142298

Something is converting to "Unicode", not "UTF-8". I see this from \u0097 (etc). But, worse than that, that is not a valid Unicode 'codepoint'.

Â¢ is Mojibake for ¢

Please provide sample Hebrew and the corresponding gibberish. It seems that there are two things conspiring to mess up your text; it is hard enough to work to reverse-engineer if there is only one conversion done.

Another thing to help debug the situation is to SELECT HEX(col) ... to see what was stored.

This Q&A may help fix it. If not, provide more info.

More

(I am using MySQL's character sets to perform this research. This may (or may not) match the encodings used in the document in question.)

לול, in utf8 encoding is D79CD795D79C; if Mojibaked becomes ×œ×•×œ. So, I can see the × and the 9C and 95. But how to get some bytes carried through, and some converted to unicode (\u...) is a mystery.

If you are using any conversion functions, remove them.

cp1250, cp1256, cp1257, latin1, latin2, latin5, latin7 treat hex D7 as '×'.
hebrew treats hex AA as ×.
The utf8 encoding for × is hex C397.

cp1250, cp1251, cp1256, cp1257, dec8, geostd8, greek, hebrew, latin1, latin5, latin7 treat hex BB as ».
latin2 treats hex BB as ť.

\u0095 is "message waiting". In general, \u009x should not show up in text.

The clues don't match up, so I continue to be stumped as how "you got from here to there".

Upvotes: 1

How to use utf8 with javascript, javaservlets, mysql and back?

Answers (2)

Related Questions