Gussing the text encoding from a UFT-8 BOM file in Java 6

Question

I'm getting txt files in both Hebrew and Arabic with a UTF-8 BOM encoding. I need to convert them to a Windows-1255 or Windows-1256 depending on the content.

How can I know, in runtime, the correct encoding to use?

No luck with Mosilla UniversalDetector, nor with any other solution that I've found. Any ideas? (I need to do it with Java 6. Don't ask why...)

Joop Eggen · Accepted Answer

As of java 1.7 the Character class knows of Unicode scripts like Arabic and Hebrew.

int freqs = s.codePoints().map(cp ->
        Character.UnicodeScript.of(cp) == Character.UnicodeScript.ARABIC ? 1
        : Character.UnicodeScript.of(cp) == Character.UnicodeScript.HEBREW ? -1
        : 0).sum();

For java 1.6 the directionality might be sufficient, as there is a RIGHT_TO_LEFT and a RIGHT_TO_LEFT_ARABIC:

    String s = "אבגדהאבגדהصضطظع"; // First Hebrew, then Arabic.
    int i0 = 0;
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        i += Character.charCount(codePoint);
        boolean rtl = Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT;
        boolean rtl2 = Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC;
        System.out.printf("[%d - %d] '%s': LTR %s %s%n",
                i0, i, s.substring(i0,  i), rtl, rtl2);
        i0 = i;
    }

[0 - 1] 'א': LTR true false
[1 - 2] 'ב': LTR true false
[2 - 3] 'ג': LTR true false
[3 - 4] 'ד': LTR true false
[4 - 5] 'ה': LTR true false
[5 - 6] 'א': LTR true false
[6 - 7] 'ב': LTR true false
[7 - 8] 'ג': LTR true false
[8 - 9] 'ד': LTR true false
[9 - 10] 'ה': LTR true false
[10 - 11] 'ص': LTR false true
[11 - 12] 'ض': LTR false true
[12 - 13] 'ط': LTR false true
[13 - 14] 'ظ': LTR false true
[14 - 15] 'ع': LTR false true

So

int arabic(String s) {
    int n = 0;
    for (char ch : s.toCharArray()) {
        if (Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC) {
            ++n;
            if (n > 1000) {
                break;
            }
        }
    }
    return n;
}
int hebrew(String s) {
    int n = 0;
    for (char ch : s.toCharArray()) {
        if (Character.getDirectionality(codePoint)
                == Character.DIRECTIONALITY_RIGHT_TO_LEFT) {
            ++n;
            if (n > 1000) {
                break;
            }
        }
    }
    return n;
}

if (arabic(s) > 0) {
    return "Windows-1256";
} else if (hebrew(s) > 0) {
    return "Windows-1255";
} else {
    return "Klingon-1257";
}

Gussing the text encoding from a UFT-8 BOM file in Java 6

Answers (1)

Related Questions