How to print and work with UTF-8 string on Java?

Question

I am working on a project that deals with foreign languages. I have this String on Java:

String string = "áçñéüöëéóíóíóíóíííããéíáéíáççãÓłńńāņšøøøøééèèÜüÜüééíéáéáříříççááññïïššäääééèèááéáéáéáéáéáéáèèèèííéèéèáééÇÇééééííüüüüííøøáááá¿¿ííóé̌Íá̌íáææööíÁíÁíííłççññá璇üşİüşİöğöğşşııããßßôèôèêééççáÁáÁééééééÇóóéíêööééííððññáñáñÓúÓúíłńłńååéééëëááéí¿¿ééÖÖáéáéöğÖüöğÖüçŞçŞııçııçııİİşİşíáíáéüüÉÉéééøññïíéé";

and I have saved my java file in utf-8 encoding.

I want to remove duplicated character, then sort characters by their unicodes, and finally print out the result string and save the string into a text file (in UTF-8 or other unicode).

I don't know if it is because of the terminal - I am working on Eclipse (Windows) and I see '?'(question mark) when printing some of the characters. What is the correct way to print the string?

I am also not sure how to SAFELY remove duplicated characters and sort the characters. For example, if I use String.charAt() and HashSet, is it safe to do so in my case? Will I get half a character for some multi-byte character? What is a safe way to compare these characters?

Knowing that the project may deal with a very large variety of different languages, what is a safe way to save the string into text file?

Update: To reproduce the question mark problem:

String str = "¿æŁéİüłïąņąø";
System.out.println(str);

It prints out this on my Eclipse console:

¿æ?é?ü?ï???ø

Note: I am already using GNU FreeMono for the console font, which has very good foreign character cover.

skomisa · Accepted Answer

When calling System.out.println(str), the charset used by the underlying PrintStream (i.e. System.out) is your default charset, and if that is not UTF-8 then you might have problems when rendering in the Eclipse console. From the javadoc for PrintStream, with my emphasis added:

All characters printed by a PrintStream are converted into bytes using the given encoding or charset, or platform's default character encoding if not specified.

So your console output is probably not working because your "platform's default character encoding" is not UTF-8. There are two simple coding approaches to resolve that:

Call java.lang.System.setOut() so that System.out uses UTF-8.
Create your own PrintStream that uses UTF-8 instead of using System.out.

Here's code which reproduces your problem, and resolves it:

package pkg;

import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;

public class Main {

    public static void main(String[] args) throws IOException {
        
        String str = "¿æŁéİüłïąņąø"; // Sample data from the question.
        
        System.out.println("1: " + str); // Fails if default charset is not UTF-8.  

        // Redirect System.out to use a PrintStream using UTF-8 charset.
        FileOutputStream fos2 = new FileOutputStream(FileDescriptor.out);
        PrintStream ps2 = new PrintStream(fos2, true, StandardCharsets.UTF_8);
        System.setOut(ps2);
        System.out.println("2: " + str); // Works.
        
        // Use your own PrintStream with UTF-8 charset instead of using System.out.
        FileOutputStream fos3 = new FileOutputStream(FileDescriptor.out);
        PrintStream ps3 = new PrintStream(fos3, true, StandardCharsets.UTF_8);
        ps3.print("3: " + str); // Works.
        ps3.close();
    }
}

This is a screen shot of the Eclipse console output from running that code, which demonstrates that both solutions described above work:

Notes:

My environment was Eclipse 2022-06 (4.24.0) with JDK 11.0.12 on Windows 10, with windows-1252 as the default charset and Consolas as the console font.
Presumably some (but not all) of the characters in your sample data rendered correctly because your "default charset" supported some (but not all) of those characters. None of the characters rendered correctly when using my default charset (windows-1252).
An alternative approach would be to change your platforms default encoding to UTF-8, so that System.out.println(str) would automatically encode using UTF-8, but that would mean your code is not portable.
The question Java JDK 18 in IntelliJ prints question mark "?" when I tried to print unicode like "\u1699" is relevant, though it focuses on println() issues with JDK 17/18 on Intellij IDEA.

How to print and work with UTF-8 string on Java?

Answers (2)

Related Questions