蔡岳霖
蔡岳霖

Reputation: 49

How to access every entry of CompressedSource in google cloud dataflow? And get Byte[] of each subfile

I have a compressed file which is a gzip file composed of multiple text file on google storage. I need to access each subfile and do some operation like regular expression. I can do the same thing on my local computer like this.

pubic static void untarFile( String filepath ) throw IOException {
  try {
    FileInputStream fin = new FileInputStream(filepath);
    BufferedInputStream in = new BufferedInputStream(fin);
    GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
    TarArchiveInputStream tarInput = new TarArchiveInputStream(gzIn);
    TarArchiveEntry entry = null;
    while ((entry = (TarArchiveEntry) tarInput.getNextTarEntry() ) != null) {
    byte[] fileContent = new byte (int)entry.getSize() ];
    tarInput.read(fileContent, 0, fileContent.length);
    }
  }
}

Therefore, I can do some other operation on fileContent which is a byte[ ]. So I used CompressedSource on google cloud dataflow and refer to its test code.It seems that I can only get every byte from file instead of whole byet[] of subfile, so I am wondering if there is any solution for me to do this on google cloud dataflow.

Upvotes: 1

Views: 75

Answers (1)

danielm
danielm

Reputation: 3010

TextIO does not support this directly, but you can create a new subclass of FileBasedSource to do this. You'll want to override isSplittable() to always return false, and then have readNextRecord() just read the entire file.

Upvotes: 1

Related Questions