Reputation: 5736
I am trying to write a method to find the equivalent codepoint in Unicode of the same visual character in ASCII given a specific codepage
For example, given a character say char c = 128
, which is '€' in Windows-1252 codepage, running the method
int result = asUnicode(c, "windows-1252")
should gives 8364
or for the same char c = 128
, which is 'Ђ' in Windows-1251 codepage, running the method
int result = asUnicode(c, "windows-1251")
should gives 1026
How this can be done in Java?
Upvotes: 1
Views: 176
Reputation: 3531
c
shouldn't really be a char
, but a byte[]
of bytes in the corresponding encoding, eg. windows-1252.
For this simple case, we can just wrap the char
into a byte[]
ourselves.
You need to decode those bytes to Java's char
type which represents BMP code points. Then you return the corresponding one.
public static int asUnicode(char c, String charset) throws Exception {
CharBuffer result = Charset.forName(charset).decode(ByteBuffer.wrap(new byte[] { (byte) c }));
int unicode;
char first = result.get();
if (Character.isSurrogate(first)) {
unicode = Character.toCodePoint(first, result.get());
} else {
unicode = first;
}
return unicode;
}
The following
public static void main(String[] args) throws Exception {
char c = 128;
System.out.println(asUnicode(c, "windows-1252"));
System.out.println(asUnicode(c, "windows-1251"));
}
prints
8364
1026
Upvotes: 2