Reputation: 49
I have a compressed file which is a gzip file composed of multiple text file on google storage. I need to access each subfile and do some operation like regular expression. I can do the same thing on my local computer like this.
pubic static void untarFile( String filepath ) throw IOException {
try {
FileInputStream fin = new FileInputStream(filepath);
BufferedInputStream in = new BufferedInputStream(fin);
GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
TarArchiveInputStream tarInput = new TarArchiveInputStream(gzIn);
TarArchiveEntry entry = null;
while ((entry = (TarArchiveEntry) tarInput.getNextTarEntry() ) != null) {
byte[] fileContent = new byte (int)entry.getSize() ];
tarInput.read(fileContent, 0, fileContent.length);
}
}
}
Therefore, I can do some other operation on fileContent which is a byte[ ]. So I used CompressedSource on google cloud dataflow and refer to its test code.It seems that I can only get every byte from file instead of whole byet[] of subfile, so I am wondering if there is any solution for me to do this on google cloud dataflow.
Upvotes: 1
Views: 75
Reputation: 3010
TextIO does not support this directly, but you can create a new subclass of FileBasedSource to do this. You'll want to override isSplittable() to always return false, and then have readNextRecord() just read the entire file.
Upvotes: 1