Kyle Durnam
Kyle Durnam

Reputation: 87

Apache beam WithTimestamps: Output timestamps must be no earlier than timestamp of current input

I am trying to window data from google cloud pubsub stream at a 10s frequency, however I get this error:

java.lang.IllegalArgumentException: Cannot output with timestamp 2019-07-20T12:13:04.875Z. Output timestamps must be no earlier than the timestamp of the current input (2019-07-20T12:13:05.591Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew. org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.checkTimestamp(SimpleDoFnRunner.java:587) org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.outputWithTimestamp(SimpleDoFnRunner.java:566) org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.outputWithTimestamp(DoFnOutputReceivers.java:80) org.apache.beam.sdk.transforms.WithTimestamps$AddTimestampsDoFn.processElement(WithTimestamps.java:136)

Here is the code that causes the error:

eventStream
  .apply("Add Event Timestamps",
    WithTimestamps.of((Event event) -> new Instant(event.getTime())))
  .apply("Window Events",
    Window.<Event>into(FixedWindows.of(Duration.parseDuration("10s"))));

What is the cause of this and what is a suitable solution?

Upvotes: 2

Views: 2768

Answers (1)

Brachi
Brachi

Reputation: 737

From the documentation:

If the input {@link PCollection} elements have timestamps, the output timestamp for each element must not be before the input element's timestamp minus the value of {@link getAllowedTimestampSkew()}. If an output timestamp is before this time, the transform will throw an {@link IllegalArgumentException} when executed. Use {@link withAllowedTimestampSkew(Duration)} to update the allowed skew.

CAUTION: Use of {@link #withAllowedTimestampSkew(Duration)} permits elements to be emitted behind the watermark. These elements are considered late, and if behind the {@link Window#withAllowedLateness(Duration) allowed lateness} of a downstream {@link PCollection} may be silently dropped.

So, to fix the issue you may play with withAllowedTimestampSkew.

I used a different API: withTimestampAttribute. You can set an attribute in your JSON/AVRO that will contain the timestamp field.

This API is available when publish:

  .apply(PubsubIO.writeAvros(Someclass.class)
         .withIdAttribute("id")
         .withTimestampAttribute("myTime").to(topic));

And when Subscribing:

.apply(PubsubIO.readAvros(Someclass.class) .fromSubscription(...)
       .withIdAttribute("id").withTimestampAttribute("myTime"))

Upvotes: 1

Related Questions