Can I store data in RAM with Apache Spark?

Question

I would like to know if it is possible to store a bunch of strings, for example, in RAM with Apache Spark. Indeed, I want to query and update these strings depending the new input data that Apache Spark is treating. Futhermore, if it is possible, can a node notify all other nodes which strings are stored ? If you need information about my projet, feel free to ask.

J

ImDarrenG · Accepted Answer

Yes, you need the stateful streaming function mapWithState. This function allows you to update state cached in memory across streaming batches.

Note that you will need to enable checkpointing if you haven't already done so.

Scala example usage:

def stateUpdateFunction(userId: UserId,
                        newData: UserAction,
                        stateData: State[UserSession]): UserModel = {
    val currentSession = stateData.get()    // Get current session data
    val updatedSession = ...            // Compute updated session using newData
    stateData.update(updatedSession)            // Update session data     
    val userModel = ...                 // Compute model using updatedSession
    return userModel                // Send model downstream
}

// Stream of user actions, keyed by the user ID
val userActions = ...  // stream of key-value tuples of (UserId, UserAction)
// Stream of data to commit
val userModels = userActions.mapWithState(StateSpec.function(stateUpdateFunction))

https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

Java example usage:

// Update the cumulative count function
Function3, State, Tuple2> mappingFunc =
    new Function3, State, Tuple2>() {
      @Override
      public Tuple2 call(String word, Optional one,
          State state) {
        int sum = one.orElse(0) + (state.exists() ? state.get() : 0);
        Tuple2 output = new Tuple2<>(word, sum);
        state.update(sum);
        return output;
      }
    };

// DStream made of get cumulative counts that get updated in every batch
JavaMapWithStateDStream> stateDstream =
    wordsDstream.mapWithState(StateSpec.function(mappingFunc).initialState(initialRDD));

https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaStatefulNetworkWordCount.java line 90:

Can I store data in RAM with Apache Spark?

Answers (1)

Related Questions