user2526586
user2526586

Reputation: 1200

How to print and work with UTF-8 string on Java?

I am working on a project that deals with foreign languages. I have this String on Java:

String string = "áçñéüöëéóíóíóíóíííããéíáéíáççãÓłńńāņšøøøøééèèÜüÜüééíéáéáříříççááññïïššäääééèèááéáéáéáéáéáéáèèèèííéèéèáééÇÇééééííüüüüííøøáááá¿¿ííóé̌Íá̌íáææööíÁíÁíííłççññá璇üşİüşİöğöğşşııããßßôèôèêééççáÁáÁééééééÇóóéíêööééííððññáñáñÓúÓúíłńłńååéééëëááéí¿¿ééÖÖáéáéöğÖüöğÖüçŞçŞııçııçııİİşİşíáíáéüüÉÉéééøññïíéé";

and I have saved my java file in utf-8 encoding.

I want to remove duplicated character, then sort characters by their unicodes, and finally print out the result string and save the string into a text file (in UTF-8 or other unicode).

I don't know if it is because of the terminal - I am working on Eclipse (Windows) and I see '?'(question mark) when printing some of the characters. What is the correct way to print the string?

I am also not sure how to SAFELY remove duplicated characters and sort the characters. For example, if I use String.charAt() and HashSet<Character>, is it safe to do so in my case? Will I get half a character for some multi-byte character? What is a safe way to compare these characters?

Knowing that the project may deal with a very large variety of different languages, what is a safe way to save the string into text file?


Update: To reproduce the question mark problem:

String str = "¿æŁéİüłïąņąø";
System.out.println(str);

It prints out this on my Eclipse console:

¿æ?é?ü?ï???ø

Note: I am already using GNU FreeMono for the console font, which has very good foreign character cover.

Upvotes: 2

Views: 2087

Answers (2)

skomisa
skomisa

Reputation: 17363

When calling System.out.println(str), the charset used by the underlying PrintStream (i.e. System.out) is your default charset, and if that is not UTF-8 then you might have problems when rendering in the Eclipse console. From the javadoc for PrintStream, with my emphasis added:

All characters printed by a PrintStream are converted into bytes using the given encoding or charset, or platform's default character encoding if not specified.

So your console output is probably not working because your "platform's default character encoding" is not UTF-8. There are two simple coding approaches to resolve that:

  • Call java.lang.System.setOut() so that System.out uses UTF-8.
  • Create your own PrintStream that uses UTF-8 instead of using System.out.

Here's code which reproduces your problem, and resolves it:

package pkg;

import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;

public class Main {

    public static void main(String[] args) throws IOException {
        
        String str = "¿æŁéİüłïąņąø"; // Sample data from the question.
        
        System.out.println("1: " + str); // Fails if default charset is not UTF-8.  

        // Redirect System.out to use a PrintStream using UTF-8 charset.
        FileOutputStream fos2 = new FileOutputStream(FileDescriptor.out);
        PrintStream ps2 = new PrintStream(fos2, true, StandardCharsets.UTF_8);
        System.setOut(ps2);
        System.out.println("2: " + str); // Works.
        
        // Use your own PrintStream with UTF-8 charset instead of using System.out.
        FileOutputStream fos3 = new FileOutputStream(FileDescriptor.out);
        PrintStream ps3 = new PrintStream(fos3, true, StandardCharsets.UTF_8);
        ps3.print("3: " + str); // Works.
        ps3.close();
    }
}

This is a screen shot of the Eclipse console output from running that code, which demonstrates that both solutions described above work:

Eclipse console

Notes:

  • My environment was Eclipse 2022-06 (4.24.0) with JDK 11.0.12 on Windows 10, with windows-1252 as the default charset and Consolas as the console font.
  • Presumably some (but not all) of the characters in your sample data rendered correctly because your "default charset" supported some (but not all) of those characters. None of the characters rendered correctly when using my default charset (windows-1252).
  • An alternative approach would be to change your platforms default encoding to UTF-8, so that System.out.println(str) would automatically encode using UTF-8, but that would mean your code is not portable.
  • The question Java JDK 18 in IntelliJ prints question mark "?" when I tried to print unicode like "\u1699" is relevant, though it focuses on println() issues with JDK 17/18 on Intellij IDEA.

Upvotes: 4

access violation
access violation

Reputation: 552

Characters in running Java programs are intrinsically Unicode (they are in fact stored as UTF-16, which you can ignore until you're interested in codes U+10000 or greater, which you're probably not at this point - but if you are, look at the 'codepoint' operations).

A String is thus automatically a string of Unicode characters.

Java source code is generally interpreted as UTF-8; this may be alterable by local convention, I'm not sure of that, since I'm a "UTF-8 only" person.

So what this boils down to is that you don't have to do anything special in a Java program to "use Unicode" - it just is.

You may need to pay attention to cases where you read and write Strings to some external medium, like a disk file or a network connection. There is a conversion to a byte stream - typically UTF-8 by default, though the default can be changed by local convention. You can explicitly specify the byte encoding in most contexts.

Your remaining problem seems to be related to display on Windows. That appears to be a font issue; you need a font containing the characters. Or, since it's Windows, it may be a matter of selecting the right "code page".

Upvotes: 2

Related Questions