Giacomo Manna
Giacomo Manna

Reputation: 31

Apache Spark & Machine Learning - Using in production

I'm having some difficulties figuring out how to use spark's machine learning capabilities in a real life production environment.

What I want to do is the following:

Let say the ml training process is handled by a notebook, and once the model requirements are fulfilled it's saved into an hdfs file, to be later loaded by a spark application

I know I could write a long running spark application that exposes the API and run it on my spark cluster, but I don't think this is really a scalable approach, because even if the data transformations and the ml functions would run on the workers node, the http/api related code would still run on one node, the one on which spark-submit is invoked (correct me if I'm wrong).

One other approach is to use the same long running application, but in a local-standalone cluster. I could deploy the same application as many times as I want, and put a load balancer in front of it. With this approach the http/api part is handled fine, but the spark part is not using the cluster capabilities at all (this could not be a problem, due to fact that it should only perform a single prediction per request)

There is a third approach which uses SparkLauncher, which wraps the spark job in a separate jar, but I don't really like flying jars, and it is difficult to retrieve the result of the prediction (a queue maybe, or hdfs)

So basically the question is: what is the best approach to consume spark's ml models through rest API?

Thank You

Upvotes: 3

Views: 1888

Answers (2)

Akarsh Gupta
Akarsh Gupta

Reputation: 111

The problem is you don't want to keep your spark cluster running and deploy your REST API inside it for the prediction as it's slow.

So to achieve real-time prediction with low latency, Here are a couple of solutions.

What we are doing is Training the model, exporting the model and use the model outside Spark to do the Prediction.

  1. You can export the model as a PMML file if the ML Algorithm you used is supported by the PMML standards. Spark ML Models can be exported as JPMML file using the jpmml library. And then you can create your REST API and use JPMML Evaluator to predict using your Spark ML Models.

  2. MLEAP MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services. It supports multiple platforms, though I have just used it for Spark ML models and it works really well.

Upvotes: 1

elcomendante
elcomendante

Reputation: 1161

you have three options

  1. trigger batch ML job via spark api spark-jobserver, upon client request
  2. trigger batch ML job via scheduler airflow , write output to DB, expose DB via rest to client
  3. keep structured-streaming / recursive functionon to scan input data source, update / append DB continuously, expose DB via rest to client

If you have single prediction per request, and your data input is constantly updated, I would suggest option 3, which would transform data in near-real-time at all times, and client would have constant access to output, you can notify client when new data is completed by sending notification via rest or sns, you could keep pretty small spark cluster that would handle data ingest, and scale rest service and DB upon request / data volume (load balancer)

If you anticipate rare requests where data source is updated periodically lets say once a day, option 1 or 2 will be suitable as you can launch bigger cluster and shut it down when completed.

Hope it helps.

Upvotes: 2

Related Questions