Google Cloud Dataflow (Apache Beam) - Always fail to parse the first line of gzip file

I have gzipped line-delimitered JSON file in GCS. I want to load them using Dataflow then save it into BigQuery.

However, it always fails to parse JSON in the first line in the file. I use Jackson and a log says

    Failed to parse JSON, com.fasterxml.jackson.core.JsonParseException:
    Unexpected character ('_' (code 95)):
    Expected space separating root-level values  at
    [Source: (String)"40085_telemetry_2015-09-09.log0000664000076600007660000011300712574251007016553
    0ustar  xxxadminxxxadmin"[truncated 142 chars];
    line: 1, column: 7

However, when I checked a file content, there was no string like above. It is surely a valid JSON string. It looks when Dataflow starts processing, the string above is prepended to the beginning of the first line.

Do you have any idea why this happens? I use Apache Beam Java SDK, Version 2.1.0.

My code showns as below:

static class ReadFile extends PTransform<PInput, PCollection<String>> {
    private static final long serialVersionUID = 1L;
    private ValueProvider<String> files;

    public ReadFile(ValueProvider<String> env, ValueProvider<String> productName, ValueProvider<String> files,
            ValueProvider<String> deadLetterId) {
        this.files = files;
    }

    @Override
    public PCollection<String> expand(PInput input) {
        Pipeline p = input.getPipeline();

        // this pipeline supports both wildcard and comma separated
        String inputFile = files.get();
        if (inputFile.contains("*")) {
            return p.apply("Read from GCS with wildcard prefix",
                    TextIO.read().from(inputFile).withCompressionType(CompressionType.GZIP));
        }

        String[] targetFiles = inputFile.split(",");
        PCollectionList<String> rowsList = PCollectionList.empty(p);
        for (String targetFile : targetFiles) {
            PCollection<String> fileLines = p.apply("Read (" + targetFile + ")",
                    TextIO.read().from(targetFile).withCompressionType(CompressionType.GZIP));
            rowsList = rowsList.and(fileLines);
        }

        PCollection<String> allRows = rowsList.apply("Flatten rows", Flatten.<String>pCollections());
        return allRows;
    }
}

Upvotes: 0

Answers (2)

SaSConsul

Reputation: 298

I built a gzip reader in java to avoid the gzip -d step. This has the advantage of saving disk space and processing time.

Upvotes: 0

Norio Akagi

Reputation: 725

I just found that it is our file's problem. When I use gzip -d in my local the weird string actually exist at the beginning of my file.

Somehow Mac's archive utility omits that part and I didn't notice that. For now, I can hard code so that my pipeline strips a string until it finds the first "{" to overcome this issue.

Upvotes: 1

Google Cloud Dataflow (Apache Beam) - Always fail to parse the first line of gzip file

Answers (2)

Related Questions