Reputation: 67
I'm practicing for the Data Engineer GCP certification exam and got the following question:
You have a Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update.
What should you do?
Possible answers:
- Update the current pipeline and use the drain flag.
- Update the current pipeline and provide the transform mapping JSON object.
The correct answer according to the website 1 my answer was 2. I'm not convinced my answer is incorrect and these are my reasons:
The only way that I see 1 as the correct answer is if you don't care about compatibility.
So which one is right?
Upvotes: 2
Views: 1735
Reputation: 56
I'm studying for the same exam, and the two cores of this question are:
1- Don't lose data ← Drain, is perfect for this because you process all buffer data and stop reviving messages; normally this message is alive for 7 days of retry, so when you start a new job you will receive all without lose any data.
2- Incompatible new code ← mapping solve some incompatibilities like change name of a ParDO but no a version issue. So launch a new job with the new code, it's the only option.
So, option is A.
Upvotes: 1
Reputation: 1428
I think the main point is that you cannot solve all the incompatibilities with the transform mapping. Mapping can be done for simple pipeline changes (for example, names), but it doesn't generalize well.
The recommended solution is constantly draining the pipeline running a legacy version, as it will stop taking any data from reading components, finish all the work pending on workers, and shutdown. When you start a new pipeline, you don't have to worry about state compatibility, as workers are starting fresh.
However, the question is indeed ambiguous, and it should be more precise about the type of incompatibility or state something like in general. Arguably you can always try to update the job with mapping, and if Dataflow finds the new job to be incompatible, it will not affect the running pipeline -- then your only choice would be the drain option.
Upvotes: 2