Reputation: 4284
UPDATED WITH SOLUTION, see at bottom
Requirement:
Process a ZIP file in Java SE 6 that contains files with special characters in the file names. As the encoding (of the ZIP producer) is not UTF-8, special characters get encoded. Therefore I would like to correct special characters into their proper code.
Issue:
The ZIP contains a file called abcüabc.txt
.
The entry gets processed via java.util.zip.ZipEntry
and when printing out single characters I see these characters (bytes):
ü
gets encoded as
u
followed by a
¨
Question:
So I would like to know how I can replace that u¨
into ü
or maybe ue
:
What I already tried and did not work out:
name.replaceAll("u\\¨", "ue");
or
name.replaceAll("ü", "ue");
Original Source Code (not working):
InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName(); // reading abcüabc.txt
System.out.println("pos 3: "+name.charAt(3));
System.out.println("pos 4: "+name.charAt(4));
System.out.println("is equal to ¨: "+Character.toString(name.charAt(4)).equals("¨"));
}
Output:
pos 3: u
pos 4:¨
is equal to ¨: false
Notes on my environment:
Zip produced under Mac OS X 10.6.8
Java SE 6: Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)
SOLUTION
Obviously, the ZIP producer (in my case Mac OSX) converts special characters into a decomposed format. So a ü
gets decomposed into u¨
.
While extracting the file names form the ZIP, we would like to convert back from the decomposed to the composed format, so we only have to insert a normalization into our source code from above:
InputStream is = new FileInputStream(new File("/Users/me/Desktop/test.zip"));
ZipInputStream zipStream = new ZipInputStream(is);
ZipEntry zipEntry = null;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName(); // reading abcüabc.txt
System.out.println("pos 3: "+name.charAt(3));
System.out.println("pos 4: "+name.charAt(4));
System.out.println("contains ü: "+name.contains("ü"));
name = Normalizer.normalize(name, Form.NFC);
System.out.println("contains ü: "+name.contains("ü"));
}
Output:
pos 3: u
pos 4:¨
contains ü: false
contains ü: true
Upvotes: 1
Views: 3999
Reputation: 4307
You can use apache ant
solve the encoding problem.
Import org.apache.tools.zip.*
ZipFile zipFile = new ZipFile(fileName,"you encoding");// you encoding like utf-8
Enumeration emu = zipFile.getEntries();
while(emu.hasMoreElements()){
ZipEntry entry = (ZipEntry) emu.nextElement();
// do something
}
Ant project does not provide an online doc, here is another doc http://api.dpml.net/ant/1.7.0/
Upvotes: 0
Reputation: 140234
That's not a ¨
(U+00A8 DIAERESIS), but the U+0308 COMBINING DIAERESIS.
The character is splitted this way because Mac Os stores file names in the Normalization Form D, which Decomposes characters like this.
You can compose it back like so:
String name = zipEntry.getName();
name = Normalizer.normalize(name, Form.NFC);
More about normalization forms
The difference between the diaeresises is how they modify or don't modify the previous base character:
System.out.println( "u" + (char)0xA8); //u¨
System.out.println( "u" + (char)0x0308); //ü
Upvotes: 3