Reputation: 647
I have fit a LinearRegression model in Spark 2.10 - after using StringIndexer and OneHotEncoder I have a ~44 element features vector. For a new bit of data I'd like to get a prediction on, how can I create a features vector from the new data element?
First, this is completely contrived example to learn how to do this. Using logs with the fields:
"elapsed_time", "api_name", "method", and "status_code"
We will create a model of label elapsed_time
and use the other fields as our feature set. The complete code will be shared below.
Results were horrible, but like I said this is a contrived exercise...
If a new log entry came in to a streaming application for example, how would I go about creating a feature vector from the new data and pass it in to predict()?
A new log entry might be:
{api_name":"/sample_api_1/v2","method":"GET","status_code":"200","elapsed_time":39}
status_code_vector
(14,[0],[1.0])
api_name_vector
(27,[0],[1.0])
method_vector
(3,[0],[1.0])
features vector
(44,[0,14,41],[1.0,1.0,1.0])
%spark
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler, StringIndexerModel, VectorSlicer}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.DataFrame
val logs = sc.textFile("/Users/z001vmk/data/sample_102M.txt")
val dfLogsRaw: DataFrame = spark.read.json(logs)
val dfLogsFiltered = dfLogsRaw.filter("status_code != 314").drop("extra_column")
// Create DF with our fields of concern.
val dfFeatures: DataFrame = dfLogsFiltered.select("elapsed_time", "api_name", "method", "status_code")
// Contrived goal:
// Use elapsed time as our label given features api_name, status_code, & method.
// Train model on small (100Mb) dataset
// Be able to predict elapsed_time given a new record similar to this example:
// --> {api_name":"/sample_api_1/v2","method":"GET","status_code":"200","elapsed_time":39}
// Indexers
val statusCodeIdxr: StringIndexer = new StringIndexer().setInputCol("status_code").setOutputCol("status_code_idx").setHandleInvalid("skip")
val apiNameIdxr: StringIndexer = new StringIndexer().setInputCol("api_name").setOutputCol("api_name_idx").setHandleInvalid("skip")
val methodIdxr: StringIndexer = new StringIndexer().setInputCol("method").setOutputCol("method_idx").setHandleInvalid("skip")
// Index features:
val dfIndexed0: DataFrame = statusCodeIdxr.fit(dfFeatures).transform(dfFeatures)
val dfIndexed1: DataFrame = apiNameIdxr.fit(dfIndexed0).transform(dfIndexed0)
val indexed: DataFrame = methodIdxr.fit(dfIndexed1).transform(dfIndexed1)
// OneHotEncoders
val statusCodeEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(statusCodeIdxr.getOutputCol).setOutputCol("status_code_vec")
val apiNameEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(apiNameIdxr.getOutputCol).setOutputCol("api_name_vec")
val methodEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(methodIdxr.getOutputCol).setOutputCol("method_vec")
// Encode feature vectors
val encoded0: DataFrame = statusCodeEncoder.transform(indexed)
val encoded1: DataFrame = apiNameEncoder.transform(encoded0)
val encoded: DataFrame = methodEncoder.transform(encoded1)
// Limit our dataset to necessary elements:
val dataset0 = encoded.select("elapsed_time", "status_code_vec", "api_name_vec", "method_vec").withColumnRenamed("elapsed_time", "label")
// Assemble feature vectors
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("status_code_vec", "api_name_vec", "method_vec")).setOutputCol("features")
val dataset1 = assembler.transform(dataset0)
dataset1.show(5,false)
// Prepare the dataset for training (optional):
val dataset: DataFrame = dataset1.select("label", "features")
dataset.show(3,false)
val Array(training, test) = dataset.randomSplit(Array(0.8, 0.2))
// Create our Linear Regression Model
val lr: LinearRegression = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features")
val lrModel = lr.fit(training)
val predictions = lrModel.transform(test)
predictions.show(20,false)
This can all be pasted into a Zeppelin notebook if you're interested.
So, what I've been scouring about for is how to transform new data into a ~35ish element feature vector and and use the model fit to the training data to transform it and get a prediction. I suspect there is metadata either held in the model itself or that would need to be maintained from the StringIndexers in this case - but that's what I cannot find.
Very happy to be pointed to docs or examples - all help appreciated.
Thank you!
Upvotes: 1
Views: 1877
Reputation: 647
After going down the road of using a PipelineModel, this became quite simple. Hat tip to @tadamhicks for getting me to look at piplines sooner than later.
Below is an updated code block that performs basically the same model creation, fit, and prediction as above but does so using pipelines and has an added bit where we predict on a newly created DataFrame to simulate how to predict on new data.
There is likely a cleaner way to rename/create our label column, but we'll leave that as a future enhancement.
%spark
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler, StringIndexerModel, VectorSlicer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.DataFrame
val logs = sc.textFile("/data/sample_102M.txt")
val dfLogsRaw: DataFrame = spark.read.json(logs)
val dfLogsFiltered = dfLogsRaw.filter("status_code != 314").drop("extra_column")
.select("elapsed_time", "api_name", "method", "status_code","cache_status")
.withColumnRenamed("elapsed_time", "label")
val Array(training, test) = dfLogsFiltered.randomSplit(Array(0.8, 0.2))
// Indexers
val statusCodeIdxr: StringIndexer = new StringIndexer().setInputCol("status_code").setOutputCol("status_code_idx").setHandleInvalid("skip")
val apiNameIdxr: StringIndexer = new StringIndexer().setInputCol("api_name").setOutputCol("api_name_idx").setHandleInvalid("skip")
val methodIdxr: StringIndexer = new StringIndexer().setInputCol("method").setOutputCol("method_idx").setHandleInvalid("skip")//"cache_status"
val cacheStatusIdxr: StringIndexer = new StringIndexer().setInputCol("cache_status").setOutputCol("cache_status_idx").setHandleInvalid("skip")
// OneHotEncoders
val statusCodeEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(statusCodeIdxr.getOutputCol).setOutputCol("status_code_vec")
val apiNameEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(apiNameIdxr.getOutputCol).setOutputCol("api_name_vec")
val methodEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(methodIdxr.getOutputCol).setOutputCol("method_vec")
val cacheStatusEncoder: OneHotEncoder = new OneHotEncoder().setInputCol(cacheStatusIdxr.getOutputCol).setOutputCol("cache_status_vec")
// Vector Assembler
val assembler: VectorAssembler = new VectorAssembler().setInputCols(Array("status_code_vec", "api_name_vec", "method_vec", "cache_status_vec")).setOutputCol("features")
val lr: LinearRegression = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setLabelCol("label").setFeaturesCol("features")
val pipeline = new Pipeline().setStages(Array(statusCodeIdxr, apiNameIdxr, methodIdxr, cacheStatusIdxr, statusCodeEncoder, apiNameEncoder, methodEncoder, cacheStatusEncoder, assembler, lr))
val plModel: PipelineModel = pipeline.fit(training)
plModel.write.overwrite().save("/tmp/spark-linear-regression-model")
plModel.transform(test).select("label", "prediction").show(5,false)
val dataElement: String = """{"api_name":"/sample_api/v2","method":"GET","status_code":"200","cache_status":"MISS","elapsed_time":39}"""
val newDataRDD = spark.sparkContext.makeRDD(dataElement :: Nil)
val newData = spark.read.json(newDataRDD).withColumnRenamed("elapsed_time", "label")
val loadedPlModel = PipelineModel.load("/tmp/spark-linear-regression-model")
loadedPlModel.transform(newData).select("label", "prediction").show
Upvotes: 1
Reputation: 925
Short answer: Pipeline models.
Just to make sure you understand, though, you don't want to create your model when you start an app, if you don't have to. Unless you're going to use DataSets and feedback, it's just silly. Create your model in a Spark Submit session (or use a notebook session like Zeppelin) and save it down. That's doing your data science.
Most DS guys hand the model over, and let the DevOps/Data Engineers use it. All they have to do is call a .predict() on the object after it's been loaded into memory.
Upvotes: 1