Reputation: 2861
Just want to know how to extract things from PCollection? Say I have applied a Count.Globally so there's a single number in the resulting PCollection, but how can I extract it as a Long value?
Thanks.
Upvotes: 4
Views: 6514
Reputation: 131
You must always think of PCollection
as of a stream. The fact, that you applied a transform that creates single value per window does not guarantee there is really only single value. This depends on the windowing strategy - so there might be single value in the case you use GlobalWindow, but there will be many values for other types of window functions (e.g. sliding windows).
Therefore it is not possible to extract this single value directly (e.g. something like PCollection.get()
) - the return value would have to be a stream. If you want to retrieve result from a PCollection, you must apply to it a transform, that will store it somewhere. There is a rich set of build-in IO modules (see here). If you want to retrieve the resulting value(s), and use it in a program later, the best option would be to store it in some shared database of your choice and retrieve this value after your Pipeline finishes. Note that this means your Pipeline is bounded (e.g. batch, not streaming), as otherwise it will never finish. But your question suggests that a bounded Pipeline is what you have in mind.
Upvotes: 1
Reputation: 6766
It depends on how you want to use that value.
If you want to read that value after your pipeline finishes you could use one of the write transforms (e.g. AvroIO.Write) to write it to some output that you could then read from whatever code executes after your pipeline finishes.
If you want to use that value in a subsequent part of your pipeline then you could apply a View transfrom to generate a PCollectionView which you could then pass as a side input to other transforms.
Consider a simple example where the goal is to print out the Count. The Count won't be available until after the pipeline runs. So in this case we could do the following
Upvotes: 3