How to update a Dataflow incompatible pipeline without loosing in data?

Question

I'm practicing for the Data Engineer GCP certification exam and got the following question:

You have a Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update.

What should you do?

Possible answers:

Update the current pipeline and use the drain flag.

Update the current pipeline and provide the transform mapping JSON object.

The correct answer according to the website 1 my answer was 2. I'm not convinced my answer is incorrect and these are my reasons:

Drain is a way to stop the pipeline and does not solve the incompatibility issues.
Mapping solves the incompatibility issue.

The only way that I see 1 as the correct answer is if you don't care about compatibility.

So which one is right?

NovasVilla · Accepted Answer

I'm studying for the same exam, and the two cores of this question are: 1- Don't lose data ← Drain, is perfect for this because you process all buffer data and stop reviving messages; normally this message is alive for 7 days of retry, so when you start a new job you will receive all without lose any data. 2- Incompatible new code ← mapping solve some incompatibilities like change name of a ParDO but no a version issue. So launch a new job with the new code, it's the only option.
So, option is A.

How to update a Dataflow incompatible pipeline without loosing in data?

Answers (2)

Related Questions