marco_sap
marco_sap

Reputation: 1869

Java servlet doesn't handle special characters correctly (like ć)

I have a java servlet that reads a parameter sent by a javascript front end. The javascript frontend uses:

escape("{€ć") which becomes "%7B%u20AC%u0107"

Well the Java servlet does this:

private static final Pattern JAVASCRIPT_ESCAPE_SEQUENCE= Pattern.compile("%(u[0-9a-fA-F]{4}|[0-9a-fA-F]{2})");




    static String unescape(String input) {
    Matcher matcher = JAVASCRIPT_ESCAPE_SEQUENCE.matcher(input);
    StringBuffer sb = new StringBuffer(input.length());
    while(matcher.find()) {
        String escapeSequence = matcher.group(1);
        if (escapeSequence.startsWith("u")) {
            escapeSequence = escapeSequence.substring(1);
        }
        char c = (char) Integer.parseInt(escapeSequence, 16);
        //System.out.println(" converted  " + Integer.parseInt("0107", 16));
        matcher.appendReplacement(sb, Character.toString(c));
    }
    matcher.appendTail(sb);
    return sb.toString();
}

String sDecodedContent = this.unescape(requestContent);

In Java the variable sDecodedContent is not "{€ć" but "{€?" and it sends it string to the backend which stores the incorrect string into the DB. Why is ć not being correctly decoded? Regards

Upvotes: 0

Views: 694

Answers (1)

rzwitserloot
rzwitserloot

Reputation: 103823

In Java the variable sDecodedContent is not "{€ć" but "{€?"

This is incorrect.

You failed to include JAVASCRIPT_ESCAPE_SEQUENCE in your paste, but assuming it is not some utterly broken take, c will end up having the value 0x0107.

So let's go with that:

char c = 0x0107;
System.out.println(Character.toString(c));

This prints ć as expected, and some extra inspection of that string shows, and no surprise there, that the character with codepoint 0x0107 is indeed in your string. Java is not randomly broken or idiotically designed, so that makes sense.

So why are you observing something else?

Because whatever System.out is sending its output to, is just a stream - a sack of bytes. charset conversion is happening all over the place. Java is thinking that the charset encoding it needs to decode that character to in order to even get it to sysout, is A, and then those bytes are rendered back into a string and shown to your eyeballs, and whatever process that is, thinks it is B, and A/B are not compatible. Alternatively, they are, but the font used to render it cannot handle 0x0107 and the glyph used to indicate 'I do not have a glyph for this' is a ?. If it is not a question mark in a black diamond shape, it's likely you have an extremely simplistic font set up, or, far more likely, that encoding issue.

So, are you running this in a terminal? You've misconfigured it. Check the documentation of bash or iterm or whatever you are using and check how to properly configure encoding. Java is sending the right stuff; it's what happens after that is at fault.

and it sends it string to the backend which stores the incorrect string into the DB.

Again, java is not at fault, which means your DB is at fault, or possibly the JDBC driver. For example, on mysql, perhaps you have used the datatype UTF8. Which is not UTF8 (Mysql is quite a bad DB with a ton of bizarre caveats that make no sense that you need to know about in order to use it properly. I strongly suggest you use a database with far fewer warts like this), or you've just left it in the default which is often some nordic ISO instead of UTF8MB4 (which is mysqlese for actual UTF8). This is often called 'collation', if that helps when perusing the documentation of the DB you use.

A trivial way to test all this stuff is to go straight to the source:

String test = "\u0107";
System.out.println(test);
sendToDb(test);

If ć is not being printed or something else arrives at the DB, you know it is not java, because "\u0107" is a literal representing ć which cannot possibly be misinterpreted and is not dependent on the charset configuration of, well, anything. That's what \u escapes in java source files are for: To ensure erroneous charsets provided to e.g. the javac command via -charset do not affect the result, at all.

You'll find it's printing ? and the DB is similarly mangling these strings. Mess with configs of your terminal and/or database until this works.

Upvotes: 2

Related Questions