Reputation: 357
So recently I've been working around with Mlib Databricks cluster and saw that according to docs XGBoost is available for my cluster version (5.1). This cluster is running Python 2.
I get the feeling that XGBoost4J is only available for Scala and Java. So my question is: how do I import the xgboost module to this environment without losing the distribution capabilites?
A sample of my code is below
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
import xgboost as xgb # Throws error because module is not installed and it should
# Transform class to classIndex to make xgboost happy
stringIndexer = StringIndexer(inputCol="species", outputCol="species_index").fit(newInput)
labelTransformed = stringIndexer.transform(newInput).drop("species")
# Compose feature columns as vectors
vectorCols = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species_index"]
vectorAssembler = VectorAssembler(inputCols=vectorCols, outputCol="features")
xgbInput = vectorAssembler.transform(labelTransformed).select("features", "species_index")
Upvotes: 0
Views: 1957
Reputation: 26
You can try to use spark-sklearn to distribute the python or scikit-learn version of xgboost, but that distribution is different than the xgboost4j distribution. I heard that the pyspark api for xgboost4j on databricks is coming, so stay tuned.
Upvotes: 1