How to re-train models on new batches only (without taking the previous training dataset) in Spark Streaming?

Question

I'm trying to write my first recommendations model (Spark 2.0.2) and i would like to know if is it possible, after initial train when the model elaborate all my rdd, work with just a delta for the future train.

Let me explain through an example:

First batch perform the first training session, with all rdd (200000 elements), when the system starts.
At the end of the train the model is saved.
A second batch application (spark streaming) load the model previously saved and listen on kinesis queue.
When a new element arrived the second batch should be perform a training (in delta mode?!) without load all the 200000 elements before but just with the model and a new element.
At the end of the train the updated model is saved.

The question is, is it possible to execute in some way the step 4?

Jacek Laskowski · Accepted Answer

My understanding is that it is only possible with machine learning algorithms that are designed to support streaming training like StreamingKMeans or StreamingLogisticRegressionWithSGD.

Quoting their documentations (see the active references above):

(StreamingLogisticRegressionWithSGD) trains or predicts a logistic regression model on streaming data. Training uses Stochastic Gradient Descent to update the model based on each new batch of incoming data from a DStream (see LogisticRegressionWithSGD for model equation)

StreamingKMeans provides methods for configuring a streaming k-means analysis, training the model on streaming, and using the model to make predictions on streaming data.

What worries me about the algorihtms is that they belong to org.apache.spark.mllib.clustering package which is now deprecated (as it's RDD-based not DataFrame-based). I don't know if they've got their JIRAs to retrofit them with DataFrame.

How to re-train models on new batches only (without taking the previous training dataset) in Spark Streaming?

Answers (1)

Related Questions