Reputation: 1200
I am working on a project that deals with foreign languages. I have this String
on Java:
String string = "áçñéüöëéóíóíóíóíííããéíáéíáççãÓłńńāņšøøøøééèèÜüÜüééíéáéáříříççááññïïššäääééèèááéáéáéáéáéáéáèèèèííéèéèáééÇÇééééííüüüüííøøáááá¿¿ííóé̌Íá̌íáææööíÁíÁíííłççññá璇üşİüşİöğöğşşııããßßôèôèêééççáÁáÁééééééÇóóéíêööééííððññáñáñÓúÓúíłńłńååéééëëááéí¿¿ééÖÖáéáéöğÖüöğÖüçŞçŞııçııçııİİşİşíáíáéüüÉÉéééøññïíéé";
and I have saved my java file in utf-8 encoding.
I want to remove duplicated character, then sort characters by their unicodes, and finally print out the result string and save the string into a text file (in UTF-8 or other unicode).
I don't know if it is because of the terminal - I am working on Eclipse (Windows) and I see '?'(question mark) when printing some of the characters. What is the correct way to print the string?
I am also not sure how to SAFELY remove duplicated characters and sort the characters. For example, if I use String.charAt()
and HashSet<Character>
, is it safe to do so in my case? Will I get half a character for some multi-byte character? What is a safe way to compare these characters?
Knowing that the project may deal with a very large variety of different languages, what is a safe way to save the string into text file?
Update: To reproduce the question mark problem:
String str = "¿æŁéİüłïąņąø";
System.out.println(str);
It prints out this on my Eclipse console:
¿æ?é?ü?ï???ø
Note: I am already using GNU FreeMono for the console font, which has very good foreign character cover.
Upvotes: 2
Views: 2087
Reputation: 17363
When calling System.out.println(str)
, the charset used by the underlying PrintStream
(i.e. System.out
) is your default charset, and if that is not UTF-8 then you might have problems when rendering in the Eclipse console. From the javadoc for PrintStream
, with my emphasis added:
All characters printed by a PrintStream are converted into bytes using the given encoding or charset, or platform's default character encoding if not specified.
So your console output is probably not working because your "platform's default character encoding" is not UTF-8. There are two simple coding approaches to resolve that:
System.out
uses UTF-8.PrintStream
that uses UTF-8 instead of using System.out
.Here's code which reproduces your problem, and resolves it:
package pkg;
import java.io.FileDescriptor;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
public class Main {
public static void main(String[] args) throws IOException {
String str = "¿æŁéİüłïąņąø"; // Sample data from the question.
System.out.println("1: " + str); // Fails if default charset is not UTF-8.
// Redirect System.out to use a PrintStream using UTF-8 charset.
FileOutputStream fos2 = new FileOutputStream(FileDescriptor.out);
PrintStream ps2 = new PrintStream(fos2, true, StandardCharsets.UTF_8);
System.setOut(ps2);
System.out.println("2: " + str); // Works.
// Use your own PrintStream with UTF-8 charset instead of using System.out.
FileOutputStream fos3 = new FileOutputStream(FileDescriptor.out);
PrintStream ps3 = new PrintStream(fos3, true, StandardCharsets.UTF_8);
ps3.print("3: " + str); // Works.
ps3.close();
}
}
This is a screen shot of the Eclipse console output from running that code, which demonstrates that both solutions described above work:
Notes:
System.out.println(str)
would automatically encode using UTF-8, but that would mean your code is not portable.println()
issues with JDK 17/18 on Intellij IDEA.Upvotes: 4
Reputation: 552
Characters in running Java programs are intrinsically Unicode (they are in fact stored as UTF-16, which you can ignore until you're interested in codes U+10000 or greater, which you're probably not at this point - but if you are, look at the 'codepoint' operations).
A String is thus automatically a string of Unicode characters.
Java source code is generally interpreted as UTF-8; this may be alterable by local convention, I'm not sure of that, since I'm a "UTF-8 only" person.
So what this boils down to is that you don't have to do anything special in a Java program to "use Unicode" - it just is.
You may need to pay attention to cases where you read and write Strings to some external medium, like a disk file or a network connection. There is a conversion to a byte stream - typically UTF-8 by default, though the default can be changed by local convention. You can explicitly specify the byte encoding in most contexts.
Your remaining problem seems to be related to display on Windows. That appears to be a font issue; you need a font containing the characters. Or, since it's Windows, it may be a matter of selecting the right "code page".
Upvotes: 2