Vreny
Vreny

Reputation: 121

How to create my own unique Charset in Java?

I would like to make my own Charset in Java and then use it for the encoding purpose. I need to add some particular symbols to my Charset as well as all of the numbers and 4 languages (Traditional Chinese, US English, Polish and Russian).

I tried to browse Charset class but didn`t really find a solution.

Upvotes: 4

Views: 1790

Answers (3)

rzwitserloot
rzwitserloot

Reputation: 103388

See this SO answer where I wrote an entire implementation including instructions on how to register it, for IBM-924.

As others have echoed, your specific needs here (namely, wanting to have 4 different sets of character groups available in a single encoding) sounds like UTF-8 is what you want, but just in case we misunderstand / just in case this post is found by someone with a legitimate need to write their own charset implementation - the linked answer shows you exactly what to do.

NB: If you think a custom charset 'adds security', oh dear. No, no, it really, really does not. You're about 100 years behind the times on crypto. This is called a basic replacement cipher, and given that text tends to have properties (such as: It's written in english, therefore it is somewhat likely to start with Hello, and the most common value is likely the letter e), you can trivially use these to undo whatever mapping you care to apply. This is the most basic of basic crypto stuff and folks were 'undoing such crypto' by hand 500 years ago. Assuming you apply an actual cryptography algorithm using a proper protocol (big if!), then also applying a custom charset to it all is a bit like sticking a plaster over the hinges on fort knox. 2 acts intended to protect so incredibly far removed from each other, it'd be played off as a joke in a movie.

Upvotes: 1

Stephen C
Stephen C

Reputation: 719307

Basil's answer explains that you don't need to define a custom Charset in order to support some non-standard symbols.

But if you really do need to do it, you will have to write a custom class that extends Charset. There are 3 abstract methods that you have to implement:

  • boolean contains(Charset cs) - Tells whether or not this charset contains the given charset.

  • CharsetDecoder newDecoder() Constructs a new decoder for this charset.

  • CharsetEncoder newEncoder() Constructs a new encoder for this charset.

The other methods in the Charset API most likely don't need to be overridden.

The decoder and encoder need to be able to convert between a ByteBuffer containing text in your charset's encoding and Unicode codepoints in a CharBuffer. While both CharsetDecoder and CharsetEncoder are also abstract classes, they require you to implement a decodeLoop or encodeLoop method (respectively) which has complicated requirements.

I am not aware of any specific documentation or tutorials on how to implement a custom Charset and its CharsetDecoder and CharsetEncoder class. But you should be able to find example code in the OpenJDK Java SE codebase. (They will be internal classes ...)


I tried to browse Charset class but didn't really find a solution.

Well the "solution" is that you will need to study existing examples ... or conclude that you don't need to solve this problem at all. See above.

Upvotes: 1

Basil Bourque
Basil Bourque

Reputation: 339562

Private Use Areas within Unicode

You’ve not really explained what goal you are trying to achieve, but likely there is no need to invent either:

  • a character set (a collection of numbers each assigned to a particular character)
  • a character encoding (a way to represent instances of those numbers as bits and bytes).

Unicode defines over 144,000 characters, each assigned a number from a range of zero to just over a million. That leaves large gaps of numbers unassigned. Some of those empty sub-ranges are reserved for future use. But, of interest to you, some of those sub-ranges are set aside for “private use”, never ever to be assigned to a character by the Unicode Consortium. See Wikipedia.

👉 You are free to assign any meaning you wish to any number within those “private use areas”. So that works as your character set.

👉 As for your character encoding, using UTF-8 is almost always best. This is true for several reasons, as discussed here.

Java supports all of Unicode. So no extra programming needed to support your characters. Everything works the same whether encountering characters from inside or outside the private use areas.

If you want to involve other people in your endeavor, or want to share documents, then you should be aware that there is an unofficial registry of characters assigned to Private Use numbers. This unofficial registry is a volunteer effort, made outside of the Unicode Consortium. This registry is for characters that would never be accepted for inclusion in Unicode. This includes imaginary languages such as Klingon from Star Trek. When selecting code point numbers for your characters, you may want to avoid these unofficially registered code points.

Upvotes: 0

Related Questions