Reputation: 33489
I have code which essentially looks like this:
class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
def train(): FoodClassifier // Very expensive - takes ~5 hours!
}
class FoodClassifier { // Light-weight API class
def isHotDog(input: Image): Boolean
}
I want to at JAR-assembly (sbt assembly
) time, invoke val classifier = new FoodTrainer(s3Dir).train()
and publish the JAR which has the classifier
instance instantly available to downstream library users.
What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
How do I do this using sbt assembly
where I do not have to check in a large model class or data file into my version control?
Upvotes: 7
Views: 1143
Reputation: 33489
Okay I managed to do this:
Separate the food-trainer module into 2 separate SBT sub-modules: food-trainer
and food-model
. The former is only invoked at compile time to create the model and serialize into the generated resources of the latter. The latter serves as a simple factory object to instantiate a model from the serialized version. Every downstream project only depends on this food-model
submodule.
The food-trainer
has the bulk of all the code and has a main method that can serialize the FoodModel
:
object FoodTrainer {
def main(args Array[String]): Unit = {
val input = args(0)
val outputDir = args(1)
val model: FoodModel = new FoodTrainer(input).train()
val out = new ObjectOutputStream(new File(outputDir + "/model.bin"))
out.writeObject(model)
}
}
Add a compile-time task to generate the food trainer module in your build.sbt
:
lazy val foodTrainer = (project in file("food-trainer"))
lazy val foodModel = (project in file("food-model"))
.dependsOn(foodTrainer)
.settings(
resourceGenerators in Compile += Def.task {
val log = streams.value.log
val dest = (resourceManaged in Compile).value
IO.createDirectory(dest)
runModuleMain(
cmd = s"com.foo.bar.FoodTrainer $pathToImages ${dest.getAbsolutePath}",
cp = (fullClasspath in Runtime in foodTrainer).value.files,
log = log
)
Seq(dest / "model.bin")
}
def runModuleMain(cmd: String, cp: Seq[File], log: Logger): Unit = {
log.info(s"Running $cmd")
val opt = ForkOptions(bootJars = cp, outputStrategy = Some(LoggedOutput(log)))
val res = Fork.scala(config = opt, arguments = cmd.split(' '))
require(res == 0, s"$cmd exited with code $res")
}
Now in your food-model
module, you have something like this:
object FoodModel {
lazy val model: FoodModel =
new ObjectInputStream(getClass.getResourceAsStream("/model.bin").readObject().asInstanceOf[FoodModel])
}
Every downstream project now only depends on food-model
and simply uses FoodModel.model
. We get the benefit of:
FoodTrainer
and FoodModel
packages into their own JARs (we have the headache of deploying them internally now) - instead we simply keep them in the same
project but different sub-modules which gets packed into a single JAR.Upvotes: 0
Reputation: 2401
The steps are as follows.
During the resource generation phase of build:
resourceGenerators in Compile += Def.task { val classifier = new FoodTrainer(s3Dir).train() val contents = FoodClassifier.serialize(classifier) val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model" IO.write(file, contents) Seq(file) }.taskValue
jar
file automatically and it won't appear in source tree.object FoodClassifierModel { lazy val classifier = readResource("/mypackage/food-classifier.model") def readResource(resourceName: String): FoodClassifier = { val stream = getClass.getResourceAsStream(resourceName) val lines = scala.io.Source.fromInputStream( stream ).getLines val contents = lines.mkString("\n") FoodClassifier.parse(contents) } } object FoodClassifier { def parse(content: String): FoodClassifier def serialize(classfier: FoodClassifier): String }
Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.
See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html
Upvotes: 4
Reputation: 83577
You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.
Upvotes: 4
Reputation: 10508
Here's an idea, throw your model in a resource folder that get's added into the jar assembly. I think all jars get distributed with your model if its in that folder. Lmk how it goes, cheers!
Check this out for reading from resource:
https://www.mkyong.com/java/java-read-a-file-from-resources-folder/
It's in Java but you can still use the api in Scala.
Upvotes: -1