Reputation: 471

How can I integrate xgboost in spark? (Python)

I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur:

TypeError: 'JavaPackage' object is not callable Exception AttributeError: "'NoneType' object has no attribute '_detach'" in ignored

I have no experience with xgboost for spark, I have tried a few tutorials online but none worked. I tried to covert to pandas df but the data is too large and I always get OutOfMemoryException from the Java wrapper (I also tried to look it up and the solution did not work for me, raising the executor memory).

The latest tutorial I was following is:

https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb

After giving up on the XGBoost module, I started using sparkxgb.

spark = create_spark_session('shai', 'dna_pipeline')
# sparkxgboost files 
spark.sparkContext.addPyFile('resources/sparkxgb.zip')

def create_spark_session(username=None, app_name="pipeline"):
    if username is not None:
        os.environ['HADOOP_USER_NAME'] = username

    return SparkSession \
        .builder \
        .master("yarn") \
        .appName(app_name) \
        .config(...) \
        .config(...) \
        .getOrCreate()

def train():
    train_df = spark.table('dna.offline_features_train_full')
    test_df = spark.table('dna.offline_features_test_full')

    from sparkxgb import XGBoostEstimator

    vectorAssembler = VectorAssembler() \
        .setInputCols(train_df.columns) \
        .setOutputCol("features")

    # This is where the program fails
    xgboost = XGBoostEstimator(
        featuresCol="features",
        labelCol="label",
        predictionCol="prediction"
    )

    pipeline = Pipeline().setStages([xgboost])
    pipeline.fit(train_df)

The full exception is:

Traceback (most recent call last):
  File "/home/elad/DNA/dna/dna/run.py", line 283, in <module>
    main()
  File "/home/elad/DNA/dna/dna/run.py", line 247, in main
    offline_model = train_model(True, home_dir=config['home_dir'], hdfs_client=client)
  File "/home/elad/DNA/dna/dna/run.py", line 222, in train_model
    model = train(offline_mode=offline, spark=spark)
  File "/home/elad/DNA/dna/dna/model/xgboost_train.py", line 285, in train
    predictionCol="prediction"
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/__init__.py", line 105, in wrapper
    return func(self, **kwargs)
  File "/tmp/spark-7781039b-6821-42be-96e0-ca4005107318/userFiles-70b3d1de-a78c-4fac-b252-2f99a6761b32/sparkxgb.zip/sparkxgb/xgboost.py", line 115, in __init__
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception AttributeError: "'NoneType' object has no attribute '_detach'" in <bound method XGBoostEstimator.__del__ of XGBoostEstimator_4f54b37156fb0a113233> ignored

I have no idea why this exception happens nor do I know how to properly integrate sparkxgb into my code.

Help would be appreciated.

thanks

Upvotes: 8

Answers (3)

Elad Cohen

Reputation: 471

After a day of debugging the hell out of this module, the problem was just submitting the jars incorrectly. I downloaded the jars locally and pyspark-submit them using:

PYSPARK_SUBMIT_ARGS=--jars resources/xgboost4j-0.72.jar,resources/xgboost4j-spark-0.72.jar

This fixed the problem.

Upvotes: 9

Daniel

Reputation: 1242

Instead of using XGBoost you can try using LightGBM which is a similar and arguably better (at least faster) algorithm. It works pretty much out of the box in pyspark, you can read more here

Upvotes: 3

Richard Rublev

Reputation: 8172

Newer Apache Spark(2.3.0) version does not have XGBoost. You should try with Pyspark. You must convert your Spark dataframe to pandas dataframe.

This is excellent article that gives workflow and explanation xgboost and spark

Ok,I read again your post and you claim that dataset is too large. May be you should try Apache Arrow. Check this Speeding up Pyspark with Apache Arrow

Upvotes: -1

How can I integrate xgboost in spark? (Python)

Answers (3)

Related Questions