Spark Streaming - Can an offline model be used against a data stream

Question

In this link - LINK, it is mentioned that a machine learning model which has been constructed offline can be used against streaming data for testing.

Excerpt from the Apache Spark Streaming MLlib link:

" You can also easily use machine learning algorithms provided by MLlib. First of all, there are streaming machine learning algorithms (e.g. Streaming Linear Regression, Streaming KMeans, etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the MLlib guide for more details. "

Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program? Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?
In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?

Am I getting this right? Please let me know if my understanding for both the points mentioned above is correct.

Amit Kumar · Accepted Answer

Does this mean that one can use a complex learning model like Random Forest model built in Spark for testing against streaming data in Spark Streaming program?

Yes, you can train a model like Random Forest in batch mode and store the model for predictions later. In case you want to integrate this with a streaming application where values are coming continuously for prediction you just need to load the model(which actually reads the feature vector and its weight) in memory and do prediction till the end.

Is it as simple as referring to the "Model" which has been built and calling "predictOnValues()" over it in Spark Streaming program?

Yes.

In this case, would the main difference between the existing spark streaming machine learning algorithms (AND) this approach be the fact that the streaming algorithms will evolve over time and the offline(against)online stream approach would still be using the insights from what it had learnt earlier without any possibility of online learning?

Training a model does nothing more than updating weight vector for features. You still have to choose alpha(learning rate) and lambda(regularisation parameter). So, when you will be using StreamingLinearRegression (or other streaming equivalents) you will have two dStreams one for training and other for prediction for obvious purposes.

Spark Streaming - Can an offline model be used against a data stream

Answers (1)

Related Questions