Issues with Dynamic Destinations in Dataflow

Question

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's FileIO and Dynamic Destinations.

About two weeks ago, I noticed an unusual buildup of unacknowledged messages. Upon restarting the df job the errors disappeared and new files were being written in GCS.

After a couple of days, writing stopped again, except this time, there were errors claiming that processing was stuck. After some trusty SO research, I found out that this was likely caused by a deadlock issue in pre 2.90 Beam because it used the Conscrypt library as the default security provider. So, I upgraded to Beam 2.11 from Beam 2.8.

Once again, it worked, until it didn't. I looked more closely at the error and noticed that it had a problem with a SimpleDateFormat object, which isn't thread-safe. So, I switched to use Java.time and DateTimeFormatter, which is thread-safe. It worked until it didn't. However, this time, the error was slightly different and didn't point to anything in my code: The error is provided below.

Processing stuck in step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least 05m00s without outputting or completing in state process
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:469)
  at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
  at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(MetricTrackingWindmillServerStub.java:202)
  at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:409)
  at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:311)
  at org.apache.beam.runners.dataflow.worker.WindmillStateReader$BagPagingIterable$1.computeNext(WindmillStateReader.java:700)
  at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
  at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
  at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:47)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:701)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)

This error started occurring approximately 5 hours after job deployment and at an increasing rate over time. Writing slowed significantly within 24 hours. I have 60 workers and I suspect that one worker fails every time there is an error, which eventually kills the job.

In my writer, I parse the lines for certain keywords (may not be the best way) in order to determine which folder it belongs in. I then proceed to insert the file to GCS with the determined filename. Here is the code I use for my writer:

The partition function is provided as the following:

@SuppressWarnings("serial")
public static class datePartition implements SerializableFunction {     
    private String filename;

    public datePartition(String filename) {
        this.filename = filename;
    }

    @Override
    public String apply(String input) {

        String folder_name = "NaN";             
        String date_dtf    = "NaN";     
        String date_literal = "NaN";
        try {
            Matcher foldernames = Pattern.compile("\"foldername\":\"(.*?)\"").matcher(input);
            if(foldernames.find()) {
                folder_name = foldernames.group(1);
            }
            else {
                Matcher folderid = Pattern.compile("\"folderid\":\"(.*?)\"").matcher(input);
                if(folderid.find()) {
                    folder_name = folderid.group(1);
                }   
            }

            Matcher date_long = Pattern.compile("\"timestamp\":\"(.*?)\"").matcher(input);
            if(date_long.find()) {
                date_literal = date_long.group(1);
                if(Utilities.isNumeric(date_literal)) {
                    LocalDateTime date = LocalDateTime.ofInstant(Instant.ofEpochMilli(Long.valueOf(date_literal)), ZoneId.systemDefault());
                    date_dtf = date.format(dtf);                        
                }
                else {
                    date_dtf = date_literal.split(":")[0].replace("-", "/").replace("T", "/");
                }
            }
            return folder_name + "/" + date_dtf + "h/" + filename;
        }

        catch(Exception e) {
            LOG.error("ERROR with either foldername or date");
            LOG.error("Line : " + input);
            LOG.error("folder : " + folder_name);
            LOG.error("Date : " + date_dtf);

            return folder_name + "/" + date_dtf + "h/" + filename;
        }           
    }
}

And the actual place where the pipeline is deployed and run can be found below:

public void streamData() {

    Pipeline pipeline = Pipeline.create(options);
    pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
            .apply(options.getWindowDuration() + " Window",
                        Window.into(FixedWindows.of(parseDuration(options.getWindowDuration())))
                                  .triggering(AfterWatermark.pastEndOfWindow()) 
                                  .discardingFiredPanes()
                                  .withAllowedLateness(parseDuration("24h")))
                .apply(new GenericFunctions.extractMsg())
                .apply(FileIO.writeDynamic()
                                 .by(new datePartition(options.getOutputFilenamePrefix()))
                                 .via(TextIO.sink())
                                 .withNumShards(options.getNumShards())
                                 .to(options.getOutputDirectory())
                                 .withNaming(type -> FileIO.Write.defaultNaming(type, ".txt"))
                                 .withDestinationCoder(StringUtf8Coder.of()));

    pipeline.run();
}

Scicrazed · Accepted Answer

Since posting this question, I've optimized the dataflow job to dodge bottlenecks and increase parallelization. Much like rsantiago explained, processing stuck isn't an error, but simply a way dataflow communicates that a step is taking significantly longer than other steps, which is essentially a bottleneck that can't be cleared with the given resources. The changes I made seem to have addressed them. The new code is as follows:

public void streamData() {

        try {
            Pipeline pipeline = Pipeline.create(options);

            pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
            .apply(options.getWindowDuration() + " Window",
                    Window.into(FixedWindows.of(parseDuration(options.getWindowDuration())))
                          .triggering(AfterWatermark.pastEndOfWindow()) 
                          .discardingFiredPanes()
                          .withAllowedLateness(parseDuration("24h")))
            .apply(FileIO.writeDynamic()
                    .by(new datePartition(options.getOutputFilenamePrefix()))
                    .via(Contextful.fn(
                            (SerializableFunction) inputMsg -> new String(inputMsg.getPayload(), StandardCharsets.UTF_8)),
                            TextIO.sink())
                    .withDestinationCoder(StringUtf8Coder.of())
                    .to(options.getOutputDirectory())
                    .withNaming(type -> new CrowdStrikeFileNaming(type))
                    .withNumShards(options.getNumShards())
                    .withTempDirectory(options.getTempLocation()));

            pipeline.run();
        }

        catch(Exception e) {

            LOG.error("Unable to deploy pipeline");
            LOG.error(e.toString(), e);
        }

    }

The biggest change involved removing the extractMsg() function and changing partitioning to only use metadata. Both of these steps forced deserialization/reserialization of messages and heavily impacted performance.

Additionally, since my data set was unbounded, I had to set a non-zero number of shards. I wanted to simplify my filenaming policy, so I set it to 1 without knowing how much it hurt performance. Since then, I've found a good balance of workers/shards/machine type for my job (mostly based on guess & check, unfortunately).

Although it's still possible that a bottleneck might be observed with a large enough data load, the pipeline has been performing well despite heavy load (3-5tb per day). The changes also significantly improved autoscaling, but I'm not sure why. The dataflow job now reacts to spikes and valleys a lot quicker.

Issues with Dynamic Destinations in Dataflow

Answers (2)

Related Questions