kry
kry

Reputation: 362

String comparision in UTF8

I have a PHP script which is supposed to return an UTF-8 encoded string. However, in Java I can't seem to compare it with it's internal string in any way.

If I print "OK" and response, they appear the same in console. However, if I check equality

if ( "OK".equals(response) ) {

the result is false. I printed out both in binary, response is 11101111 10111011 10111111 01001111 01001011, the Java's String "OK" however is 01001111 01001011 which is cleary ASCII. I tried to convert it to UTF8 in a few ways, but no avail:

String result2 = new String("OK".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8);

and

String result2 = new String("OK".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);

are both not working, still return ASCII codes for some reason.

byte[] result2 = "OK".getBytes(StandardCharsets.UTF_8); System.out.print(new String(result2));

While this also gives the correct "OK" result, in binary it still returns ASCII.

I've tried to change communication to numbers instead, but 1 still does not equal to 1, as Integer.parseInt(response) returns "1" is not a String error message, altough in every other aspect, it is recognised as a normal String.

I'm looking for a solution preferably where "OK" is converted to UTF-8 and not response to ASCII, since I need to communicate with a PHP script along with 2 databases, all set to UTF-8. Java is started with the switch -Dfile.encoding=UTF8 to ensure national characters are not broken.

Upvotes: 0

Views: 112

Answers (1)

AterLux
AterLux

Reputation: 4654

in UTF-8 all characters with codes 127 or less are encoded by a single byte. Therefore "OK" in UTF-8 and ASCII is the same two bytes.

11101111 10111011 10111111 01001111 01001011 it is not just simple "OK" it is

0xEF, 0xBB, 0xBF, "OK"

where 0xEF, 0xBB, 0xBF are a BOM (Byte order mark)

It is symbols which are not displayed by editors but used to determine the encoding.

Probably those symbols appeared in you php script before <?php

You have to configure your editor to remove BOM from the file

UPD

If it is not possible to alter the php script, you can use a workaround:

  // check if the first symbol of the response is BOM
  if (!response.isEmpty() && (response.charAt(0) == 0xFEFF)) {
    // removing the first symbol
    response = response.substring(1);
  }

Upvotes: 4

Related Questions