Reputation: 1615
I am using mapGroupsWithState
on a streaming dataset to maintain state across batches. Where is this data/state stored? Executors, the driver or somewhere else?
Upvotes: 5
Views: 212
Reputation: 74669
but I am not sure where this data/state is stored? (Executor or driver)
State is persisted to [checkpointLocation]/state
that should be on a reliable HDFS-compliant distributed file system so executors (and tasks) can access it when required.
That gives [checkpointLocation]/state
.
There could be many stateful operators, each with its own operatorId
that is used to store operator-specific state. That's why you may have zero, one or more state subdirectories for every stateful operator.
That gives [checkpointLocation]/state/[operatorId]
.
There are even more subdirectories in stateful operator-specific state directory for partitions.
That gives the following state-specific directory layout:
[checkpointLocation]/state/[operatorId]/[partitionId]
Use web UI to find out the checkpointLocation
, operatorId
and the number of partitions.
The state of a stateful operator is recreated from [checkpointLocation]/state
using StateStoreRestoreExec
unary physical operator (use explain
to find it). StateStoreRestoreExec
restores (reads) a streaming state from a state store for the keys that are given by the child physical operator. My understanding is that the state is recreated every micro-batch.
Upvotes: 3