Reading ZipEntry containing special characters while using Java SE6

Question

UPDATED WITH SOLUTION, see at bottom

Requirement:
Process a ZIP file in Java SE 6 that contains files with special characters in the file names. As the encoding (of the ZIP producer) is not UTF-8, special characters get encoded. Therefore I would like to correct special characters into their proper code.

Issue:
The ZIP contains a file called abcüabc.txt . The entry gets processed via java.util.zip.ZipEntry and when printing out single characters I see these characters (bytes):

ü gets encoded as
u followed by a
¨

Question:
So I would like to know how I can replace that u¨ into ü or maybe ue:

What I already tried and did not work out:
name.replaceAll("u\¨", "ue");
or
name.replaceAll("ü", "ue");

Original Source Code (not working):

InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
    String name = zipEntry.getName(); // reading abcüabc.txt
    System.out.println("pos 3: "+name.charAt(3));
    System.out.println("pos 4: "+name.charAt(4));
    System.out.println("is equal to ¨: "+Character.toString(name.charAt(4)).equals("¨"));
}

Output:

pos 3: u
pos 4:¨
is equal to ¨: false

Notes on my environment:

Zip produced under Mac OS X 10.6.8
Java SE 6: Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)

SOLUTION

Obviously, the ZIP producer (in my case Mac OSX) converts special characters into a decomposed format. So a ü gets decomposed into u¨.
While extracting the file names form the ZIP, we would like to convert back from the decomposed to the composed format, so we only have to insert a normalization into our source code from above:

InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
    String name = zipEntry.getName(); // reading abcüabc.txt
    System.out.println("pos 3: "+name.charAt(3));
    System.out.println("pos 4: "+name.charAt(4));
    System.out.println("contains ü: "+name.contains("ü"));
    name = Normalizer.normalize(name, Form.NFC);
    System.out.println("contains ü: "+name.contains("ü"));
}

Output:

pos 3: u
pos 4:¨
contains ü: false
contains ü: true

Esailija · Accepted Answer

That's not a ¨ (U+00A8 DIAERESIS), but the U+0308 COMBINING DIAERESIS.

The character is splitted this way because Mac Os stores file names in the Normalization Form D, which Decomposes characters like this.

You can compose it back like so:

String name = zipEntry.getName(); 
name = Normalizer.normalize(name, Form.NFC);

More about normalization forms

The difference between the diaeresises is how they modify or don't modify the previous base character:

    System.out.println( "u" + (char)0xA8); //u¨
    System.out.println( "u" + (char)0x0308); //ü

Reading ZipEntry containing special characters while using Java SE6

Answers (2)

Related Questions