darkjh
darkjh

Reputation: 2861

How to extract contents from PCollection in Cloud Dataflow?

Just want to know how to extract things from PCollection? Say I have applied a Count.Globally so there's a single number in the resulting PCollection, but how can I extract it as a Long value?

Thanks.

Upvotes: 4

Views: 6514

Answers (2)

Jan Lukavsky
Jan Lukavsky

Reputation: 131

You must always think of PCollection as of a stream. The fact, that you applied a transform that creates single value per window does not guarantee there is really only single value. This depends on the windowing strategy - so there might be single value in the case you use GlobalWindow, but there will be many values for other types of window functions (e.g. sliding windows).

Therefore it is not possible to extract this single value directly (e.g. something like PCollection.get()) - the return value would have to be a stream. If you want to retrieve result from a PCollection, you must apply to it a transform, that will store it somewhere. There is a rich set of build-in IO modules (see here). If you want to retrieve the resulting value(s), and use it in a program later, the best option would be to store it in some shared database of your choice and retrieve this value after your Pipeline finishes. Note that this means your Pipeline is bounded (e.g. batch, not streaming), as otherwise it will never finish. But your question suggests that a bounded Pipeline is what you have in mind.

Upvotes: 1

Jeremy Lewi
Jeremy Lewi

Reputation: 6766

It depends on how you want to use that value.

If you want to read that value after your pipeline finishes you could use one of the write transforms (e.g. AvroIO.Write) to write it to some output that you could then read from whatever code executes after your pipeline finishes.

If you want to use that value in a subsequent part of your pipeline then you could apply a View transfrom to generate a PCollectionView which you could then pass as a side input to other transforms.

Consider a simple example where the goal is to print out the Count. The Count won't be available until after the pipeline runs. So in this case we could do the following

  • Define a DoFn<Long, String> which we apply to the count in order to turn the Long into the message we want to print out.
  • Apply a TextIO.Write transform to write the message to a file.
  • Run the job and wait for it to finish. If we want to execute using the Dataflow Service we can use BlockingDataflowRunner to wait for the job to finish.
  • After the job finishes read the text file created to get the message and print it out.

Upvotes: 3

Related Questions