Srii
Srii

Reputation: 587

Java: Advise on Charset Conversion

I have been working on a scenario that does the following:

  1. Get input data in Unicode format; [UTF-8]
  2. Convert to ISO-8559;
  3. Detect & replace unsupported characters for encoding; [Based on user-defined key-value pairs]

My question is, I have been trying to find information on ISO-8559 in depth with no luck yet. Has anybody happen to know more about this? How different is this one from ISO-8859? Any details will be much helpful.

Secondly, keeping the ISO-8559 requirement aside, I went ahead to write my program to convert the incoming data to ISO-8859 in Java. While I am able to achieve what is needed using character based replacement, it obviously seem to be time-consuming when data size is huge. [in MBs]

I am sure there must be a better way to do this. Can someone advise me, please?

Upvotes: 1

Views: 323

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109532

I assume you want to convert UTF-8 to ISO-8859 -1, that is Western Latin-1. There are many char set tables in the net.

  1. In general for web browsers and Windows, it would be better to convert to Windows-1252, which is an extension redefining the range 0x80 - 0xBF, undermore with special quotes as seen in MS Word. Browsers are defacto capable to interprete these codes in an ISO-559-1 even on a Mac.

  2. Java standard conversion like new OutputStreamWriter(new FileOutputStream("..."), "Windows-1252") does already much. You can either write a kind of filter, or find introduced ? untranslated special characters. You could translate latin letters with accents not in Windows-1252 as ASCII letters:

        String s = ...
        s = Normalizer.normalize(s, Normalizer.Form.NFD);
        return s = s.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
    
  3. For other scripts like Hindi or Cyrillic the keyword to search for is transliteration.

Upvotes: 2

Related Questions