Dmitry Ermolov
Dmitry Ermolov

Reputation: 2237

Apache Beam: maintaining state in distributed KV table

I'm trying to better understand Beam computation model and to check if my problem is solvable within this model.

Suppose I have a stream of events,

class Event {
    public int userId;
    public int score;
}

I want to build pipeline that:

I've read about stateful processing and as far as I understand it's easy to maintain maximum score for user inside StatefulParDo. But how such state is stored is Beam implementation detail and this state is not available outside StatefulParDo function.

Is it possible to keep such state in well defined format in some sort of KV storage available for external consumers (readers outside of my pipeline)?

Upvotes: 0

Views: 182

Answers (1)

chamikara
chamikara

Reputation: 2024

So you have to pick either Beam State API or an external storage system.

Exactly where Beam State is stored is up to the runner. You cannot directly access such state outside the State API.

If you decide to use the external storage path, you could write to such a storage system from a Beam ParDo. But you'll have to handle performance when reading/writing and consistency of such data. Also you have to assume that any Beam step might fail and may be re-run by the runner (hence should prevent duplicate writes).

Upvotes: 1

Related Questions