basZero
basZero

Reputation: 4284

Reading ZipEntry containing special characters while using Java SE6

UPDATED WITH SOLUTION, see at bottom

Requirement:
Process a ZIP file in Java SE 6 that contains files with special characters in the file names. As the encoding (of the ZIP producer) is not UTF-8, special characters get encoded. Therefore I would like to correct special characters into their proper code.

Issue:
The ZIP contains a file called abcüabc.txt . The entry gets processed via java.util.zip.ZipEntry and when printing out single characters I see these characters (bytes):

ü gets encoded as
u followed by a
¨

Question:
So I would like to know how I can replace that into ü or maybe ue:

What I already tried and did not work out:
name.replaceAll("u\\¨", "ue");
or
name.replaceAll("ü", "ue");

Original Source Code (not working):

InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
    String name = zipEntry.getName(); // reading abcüabc.txt
    System.out.println("pos 3: "+name.charAt(3));
    System.out.println("pos 4: "+name.charAt(4));
    System.out.println("is equal to ¨: "+Character.toString(name.charAt(4)).equals("¨"));
}        

Output:

pos 3: u
pos 4:¨
is equal to ¨: false

Notes on my environment:

Zip produced under Mac OS X 10.6.8
Java SE 6: Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)

SOLUTION

Obviously, the ZIP producer (in my case Mac OSX) converts special characters into a decomposed format. So a ü gets decomposed into .
While extracting the file names form the ZIP, we would like to convert back from the decomposed to the composed format, so we only have to insert a normalization into our source code from above:

InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
    String name = zipEntry.getName(); // reading abcüabc.txt
    System.out.println("pos 3: "+name.charAt(3));
    System.out.println("pos 4: "+name.charAt(4));
    System.out.println("contains ü: "+name.contains("ü"));
    name = Normalizer.normalize(name, Form.NFC);
    System.out.println("contains ü: "+name.contains("ü"));
}        

Output:

pos 3: u
pos 4:¨
contains ü: false
contains ü: true

Upvotes: 1

Views: 3999

Answers (2)

lichengwu
lichengwu

Reputation: 4307

You can use apache ant solve the encoding problem.

Import org.apache.tools.zip.*

ZipFile zipFile = new ZipFile(fileName,"you encoding");// you encoding like utf-8 
Enumeration emu = zipFile.getEntries();


while(emu.hasMoreElements()){
  ZipEntry entry = (ZipEntry) emu.nextElement();
  // do something
}

Ant project does not provide an online doc, here is another doc http://api.dpml.net/ant/1.7.0/

Upvotes: 0

Esailija
Esailija

Reputation: 140234

That's not a ¨ (U+00A8 DIAERESIS), but the U+0308 COMBINING DIAERESIS.

The character is splitted this way because Mac Os stores file names in the Normalization Form D, which Decomposes characters like this.

You can compose it back like so:

String name = zipEntry.getName(); 
name = Normalizer.normalize(name, Form.NFC);

More about normalization forms

The difference between the diaeresises is how they modify or don't modify the previous base character:

    System.out.println( "u" + (char)0xA8); //u¨
    System.out.println( "u" + (char)0x0308); //ü

Upvotes: 3

Related Questions