Reputation: 33

Reading unicode character in java

I'm a bit new to java, When I assign a unicode string to

  String str = "\u0142o\u017Cy\u0142";
  System.out.println(str);

  final StringBuilder stringBuilder = new StringBuilder();
  InputStream inStream = new FileInputStream("C:/a.txt");
  final InputStreamReader streamReader = new InputStreamReader(inStream, "UTF-8");
  final BufferedReader bufferedReader = new BufferedReader(streamReader);
  String line = "";
  while ((line = bufferedReader.readLine()) != null) {
      System.out.println(line);
      stringBuilder.append(line);
  }

Why are the results different in both cases the file a.txt also contains the same string. but when i print output of the file it prints z\u0142o\u017Cy\u0142 instead of the actual unicode characters. Any idea how do i do this if i want to file content also to be printed as string is being printed.

Upvotes: 3

Answers (7)

tchrist

Reputation: 80443

I posted Java code to unescape (“descape”?) such things and many others in this answer.

Upvotes: 0

InsertNickHere

Reputation: 3666

I think its just "UTF8" not "UTF-8".

Here I saw it: Source

Upvotes: 0

Alex

Reputation: 19

You have used FileInputStream and is a byte code reader not character reader. Try using FileReader instead

something like:

BufferedReader inputStream = new BufferedReader(new FileReader("C:/a.txt"));

then you can use the line oriented I/O BufferedReader to read each line. FileInputREader is a low level I/O that you should avoid. You're writing the characters to your file not the bytes, the best approach is to use character streams. for wrinting and reading unless you need to write bytes/binary data.

Upvotes: -1

BalusC

Reputation: 1109645

So, you want to unescape unicode codepoints? There is no public API available for this. The java.util.Properties has a loadConvert() method which does exactly this, but it's private. Check the Java source for the case you'd like to reuse this. It's doing the conversion by simple parsing. I wouldn't use regex for this since this is too error prone in very specific circumstances.

Or you should probably after all be using java.util.Properties or its i18n counterpart java.util.ResourceBundle with a .properties file instead of a plain .txt file.

There's probably a library for decoding these but you could do it yourself - according to the Java Language Specification an escape sequence is always of the form \uxxxx, so you could get the 4-digit hex value xxxx for the character, convert it to an integer with Integer.parseInt, convert it to a character and finally replace the whole \uxxxx sequence with the character.

Upvotes: 2

Reading unicode character in java

Answers (7)

See also:

Related Questions