conetfun
conetfun

Reputation: 1615

Where is the state saved for arbitrary state processing using mapGroupsWithState?

I am using mapGroupsWithState on a streaming dataset to maintain state across batches. Where is this data/state stored? Executors, the driver or somewhere else?

Upvotes: 5

Views: 212

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74669

but I am not sure where this data/state is stored? (Executor or driver)

State is persisted to [checkpointLocation]/state that should be on a reliable HDFS-compliant distributed file system so executors (and tasks) can access it when required.

That gives [checkpointLocation]/state.

There could be many stateful operators, each with its own operatorId that is used to store operator-specific state. That's why you may have zero, one or more state subdirectories for every stateful operator.

That gives [checkpointLocation]/state/[operatorId].

There are even more subdirectories in stateful operator-specific state directory for partitions.

That gives the following state-specific directory layout:

[checkpointLocation]/state/[operatorId]/[partitionId]

Use web UI to find out the checkpointLocation, operatorId and the number of partitions.

The state of a stateful operator is recreated from [checkpointLocation]/state using StateStoreRestoreExec unary physical operator (use explain to find it). StateStoreRestoreExec restores (reads) a streaming state from a state store for the keys that are given by the child physical operator. My understanding is that the state is recreated every micro-batch.

Upvotes: 3

Related Questions