XGBoost in Databricks with Python

Question

So recently I've been working around with Mlib Databricks cluster and saw that according to docs XGBoost is available for my cluster version (5.1). This cluster is running Python 2.

I get the feeling that XGBoost4J is only available for Scala and Java. So my question is: how do I import the xgboost module to this environment without losing the distribution capabilites?

A sample of my code is below

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
import xgboost as xgb # Throws error because module is not installed and it should

# Transform class to classIndex to make xgboost happy
stringIndexer = StringIndexer(inputCol="species", outputCol="species_index").fit(newInput)
labelTransformed = stringIndexer.transform(newInput).drop("species")

# Compose feature columns as vectors
vectorCols = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species_index"]
vectorAssembler = VectorAssembler(inputCols=vectorCols, outputCol="features")
xgbInput = vectorAssembler.transform(labelTransformed).select("features", "species_index")

pebblecoin · Accepted Answer

You can try to use spark-sklearn to distribute the python or scikit-learn version of xgboost, but that distribution is different than the xgboost4j distribution. I heard that the pyspark api for xgboost4j on databricks is coming, so stay tuned.

XGBoost in Databricks with Python

Answers (2)

Related Questions