Side input in global window as slowly changing cache questions

Question

Context: We have some schema files in Cloud Storage. In our Dataflow job, we need to refer to these schema files to transform our data. These schema files change on a daily/weekly basis. Our data source is PubSub and we window PubSub messages into a fixed window of 1 minutes. The schema files we need fit well into memory, they are about 90 MB.

What I have tried: Referring to this doc from Apache Beam, we created a side input that writes into a global window with a GenerateSequence like so:

    // Creates a side input that refreshes the schema every minute
PCollectionView> dataBlobView =
    pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
        .apply(Window.into(new GlobalWindows()).triggering(
            Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
            .discardingFiredPanes())
        .apply(ParDo.of(new DoFn>() {
          @ProcessElement
          public void processElement(ProcessContext ctx) throws Exception {
            byte[] avroSchemaBlob = getAvroSchema();
            byte[] fileDescriptorSetBlob = getFileDescriptorSet();
            byte[] depsBlob = getFileDescriptorDeps();
            Map dataBlobs = ImmutableMap.of(
                "version", Longs.toByteArray(ctx.element().byteValue()),
                "avroSchemaBlob", avroSchemaBlob,
                "fileDescriptorSetBlob", fileDescriptorSetBlob,
                "depsBlob", depsBlob);
            ctx.output(dataBlobs);
          }
        }))
        .apply(View.asSingleton());

"getAvroSchema", "getFileDescriptorSet" and "getFileDescriptorDeps" read files as byte[] from Cloud Storage.

However, this approach failed from the exception:

org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.

I then tried writing my own Combine Globally function like so:

static class GetLatestVersion implements SerializableFunction>, Map> {
@Override
public Map apply(Iterable> versions) {
  Map result = Maps.newHashMap();
  Long maxVersion = Long.MIN_VALUE;
  for (Map version: versions){
    Long currentVersion = Longs.fromByteArray(version.get("version"));
    logger.info("Side input version: " + currentVersion);
    if (currentVersion > maxVersion) {
      result = version;
      maxVersion = currentVersion;
    }
  }
  return result;
}

}

But it still triggers the same exception........

I then came across this and this Beam email archives and it seems like what's suggested in the Beam doc does not work. And I have to use a MultiMap to avoid the exception I ran into above. With a MultiMap, I will also have to iterate through the values and have my own logic to pick my desired value (latest).

My questions:

Why do I still get the exception "PCollection with more than one element accessed as a singleton view" even after I globally combine everything into 1 result?
If I go with the MultiMap approach, wouldn't the job eventually run out of memory? Because everyday we are basically increasing the MultiMap by 90 MB (the size of our data blob), unless Dataflow has some smart MultiMap implementation behind the scene.
What is the recommended way to do this?

Thanks

Side input in global window as slowly changing cache questions

Answers (1)

Related Questions