Reputation: 565
I try to unzip file.zip with files (a, b, c) in pentaho kettle (file management -> unzip file). it working fine. But if i try to unzip file.zip with files (a, b, ж), for example, i have errors:
2016/01/18 17:46:17 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp
2016/01/18 17:46:17 - Unzip file - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : Could not unzip file [file:///D:/projects/loaders/loader_little_files/src.zip]. Exception : [MALFORMED]
2016/01/18 17:46:17 - Unzip file - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : java.lang.IllegalArgumentException: MALFORMED
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile.getZipEntry(ZipFile.java:566)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile.access$900(ZipFile.java:60)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:524)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.zip.ZipFileSystem.init(ZipFileSystem.java:91)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractVfsContainer.addComponent(AbstractVfsContainer.java:53)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractFileProvider.addFileSystem(AbstractFileProvider.java:103)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.createFileSystem(AbstractLayeredFileProvider.java:88)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.findFile(AbstractLayeredFileProvider.java:61)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:790)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:712)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.core.vfs.KettleVFS.getFileObject(KettleVFS.java:151)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.core.vfs.KettleVFS.getFileObject(KettleVFS.java:106)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.entries.unzip.JobEntryUnZip.unzipFile(JobEntryUnZip.java:618)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.entries.unzip.JobEntryUnZip.processOneFile(JobEntryUnZip.java:516)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.entries.unzip.JobEntryUnZip.execute(JobEntryUnZip.java:461)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:730)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:546)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.run(Job.java:435)
I'am using windows 7, when i create "ж" file.
I'am trying to rename file in linux to "ж" - the result has not changed.
How can i do this? Any hidden setting? Thanks!
Upvotes: 1
Views: 2336
Reputation: 91
only one worked for me in Debian Jessie - install WinRAR into wine and choose there file names encoding
Upvotes: 0
Reputation: 1842
How to decompress zip file created on Windows 8.1, using 7zip. Files have names contain cyrilic symbols. Zip archive contains 3 files inside named:
Fortunately all needed libraries (Apache commons-compress and commons-io) are in directory PENTAHO_HOME/lib, so u don't have to add extra libraries to kettle.
Here is code underneath, for "User Defined Java Class" step
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Enumeration;
import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
import org.apache.commons.compress.archivers.zip.ZipFile;
import org.apache.commons.io.IOUtils;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException{
Object[] r = getRow();
r = createOutputRow(r, data.outputRowMeta.size());
String fname = getVariable("FNAME", null);
String outDir = getVariable("OUT", null);
System.out.println(fname + " " + outDir);
try {
java.io.File inputFile = new java.io.File(fname);
ZipFile zipFile = new ZipFile(inputFile, "cp866", false);
Enumeration enumEntry = zipFile.getEntries();
int i = 0;
while(enumEntry.hasMoreElements()){
ZipArchiveEntry entry = (ZipArchiveEntry) enumEntry.nextElement();
String entryName = entry.getName();
System.out.println(entryName);
OutputStream os = new FileOutputStream(new File(outDir, Integer.valueOf(++i) + entryName));
InputStream is = zipFile.getInputStream(entry);
IOUtils.copy(is, os);
is.close();
os.close();
}
} catch (Exception exc) {
System.out.println("Faild to unzip");
exc.printStackTrace();
}
putRow(data.outputRowMeta, r);
return true;
}
Important parts of code are:
String fname = getVariable("FNAME", null);
String outDir = getVariable("OUT", null);
Those mean that 2 variables should be available in transformation
FNAME - absolute path to ZipFile,
OUT - directory where need to extract files
In this line:
ZipFile zipFile = new ZipFile(inputFile, "cp866", false);
"cp866" means encoding used by 7zip for zipfile entries(cp866 on windows). If u use another zipper then u might need to change encoding. Here is some notice https://commons.apache.org/proper/commons-compress/zip.html. Part Recommendations for Interoperability. U can write own algorith to identify encoding, rely on for example on known part of name of files in zip archive. Anyway I think most probably this kettle job/tranformation will use zip file from single certain source, and just need to identify and set proper encoding of zip file in code.
This line:
Integer.valueOf(++i) + entryName)
Why file name generated using integer? If wrong encoding is used then ZipFile will decode filename of zip entries to [].txt (ZipFile can't decode а.txt, ж.txt so it will replace symbols 'а', 'ж' with '[]'). Which lead to (if u have wrong encoding and filenames have same length and written in cyrilic) each enty in loop will overwrite same file and u will get in the end, single file named [].txt.
With counter in file name u will guaranty all files will have different name even if u not able to decode correct file name.
1[].txt
2[].txt
3[].txt
Anyway if u know exactly encoding then just remove this part of code to eliminate numbers in file name.
Upvotes: 1
Reputation: 1842
Non utf-8 encoding in zip files.
Taken from here. https://blogs.oracle.com/xuemingshen/entry/non_utf_8_encoding_in
Important parts
Windows NFTS filesystem encoding UTF-16. Cyrillic symbols in file names cause problems in java application. Troubles will arise in use some third party tools to create zip archive (unless u use java based tools - which rarely) and then unzip them using java tools like PDI.
Good staff for Linux users, ext4 use by default UTF-8 (actually it doesn't rely on encoding just byte sequence, but GUI like gnome (environment where u create files whatever shell, or gnome nautilus file manager) assume UTF-8 to decode symbols to write file name on disk. QT relies on locale. Of cause there are ways to override but by default as I know UTF-8 become wide used as default locale.
Conclusion:
Upvotes: 3