How to report invalid data while processing data with Google dataflow?

Question

I am looking at the documentation and the provided examples to find out how I can report invalid data while processing data with Google's dataflow service.

Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
 .apply(new SomeTransformation())
 .apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
p.run();

In addition to the actual in-/output, I want to produce a 2nd output file that contains records that which are considered invalid (e.g. missing data, malformed data, values were too high). I want to troubleshoot those records and process them separately.

Input: gs://.../input.csv
Output: gs://.../output.csv
List of invalid records: gs://.../invalid.csv

How can I redirect those invalid records into a separate output?

robertwb · Accepted Answer

You can use PCollectionTuples to return multiple PCollections from a single transform. For example,

TupleTag mainOutput = new TupleTag<>("main");
TupleTag missingData = new TupleTag<>("missing");
TupleTag badValues = new TupleTag<>("bad");

Pipeline p = Pipeline.create(options);
PCollectionTuple all = p
   .apply(TextIO.Read.named("ReadMyFile").from(options.getInput()))
   .apply(new SomeTransformation());

all.get(mainOutput)
   .apply(TextIO.Write.named("WriteMyFile").to(options.getOutput()));
all.get(missingData)
   .apply(TextIO.Write.named("WriteMissingData").to(...));
...

PCollectionTuples can either be built up directly out of existing PCollections, or emitted from ParDo operations with side outputs, e.g.

PCollectionTuple partitioned = input.apply(ParDo
    .of(new DoFn() {
          public void processElement(ProcessContext c) {
             if (checkOK(c.element()) {
                 // Shows up in partitioned.get(mainOutput).
                 c.output(...);
             } else if (hasMissingData(c.element())) {
                 // Shows up in partitioned.get(missingData).
                 c.sideOutput(missingData, c.element());
             } else {
                 // Shows up in partitioned.get(badValues).
                 c.sideOutput(badValues, c.element());
             }
          }
        })
    .withOutputTags(mainOutput, TupleTagList.of(missingData).and(badValues)));

Note that in general the various side outputs need not have the same type, and data can be emitted any number of times to any number of side outputs (rather than the strict partitioning we have here).

Your SomeTransformation class could then look something like

class SomeTransformation extends PTransform,
                                            PCollectionTuple> {
  public PCollectionTuple apply(PCollection input) {
    // Filter into good and bad data.
    PCollectionTuple partitioned = ...
    // Process the good data.
    PCollection processed =
        partitioned.get(mainOutput)
                   .apply(...)
                   .apply(...)
                   ...;
    // Repackage everything into a new output tuple.
    return PCollectionTuple.of(mainOutput, processed)
                           .and(missingData, partitioned.get(missingData))
                           .and(badValues, partitioned.get(badValues));
  }
}

How to report invalid data while processing data with Google dataflow?

Answers (2)

Related Questions