Reputation: 471
I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur:
TypeError: 'JavaPackage' object is not callable Exception AttributeError: "'NoneType' object has no attribute '_detach'" in ignored
I have no experience with xgboost for spark, I have tried a few tutorials online but none worked.
I tried to covert to pandas df but the data is too large and I always get OutOfMemoryException
from the Java wrapper (I also tried to look it up and the solution did not work for me, raising the executor memory).
The latest tutorial I was following is:
After giving up on the XGBoost module, I started using sparkxgb
.
spark = create_spark_session('shai', 'dna_pipeline')
# sparkxgboost files
spark.sparkContext.addPyFile('resources/sparkxgb.zip')
def create_spark_session(username=None, app_name="pipeline"):
if username is not None:
os.environ['HADOOP_USER_NAME'] = username
return SparkSession \
.builder \
.master("yarn") \
.appName(app_name) \
.config(...) \
.config(...) \
.getOrCreate()
def train():
train_df = spark.table('dna.offline_features_train_full')
test_df = spark.table('dna.offline_features_test_full')
from sparkxgb import XGBoostEstimator
vectorAssembler = VectorAssembler() \
.setInputCols(train_df.columns) \
.setOutputCol("features")
# This is where the program fails
xgboost = XGBoostEstimator(
featuresCol="features",
labelCol="label",
predictionCol="prediction"
)
pipeline = Pipeline().setStages([xgboost])
pipeline.fit(train_df)
The full exception is:
Traceback (most recent call last):
File "/home/elad/DNA/dna/dna/run.py", line 283, in <module>
main()
File "/home/elad/DNA/dna/dna/run.py", line 247, in main
offline_model = train_model(True, home_dir=config['home_dir'], hdfs_client=client)
File "/home/elad/DNA/dna/dna/run.py", line 222, in train_model
model = train(offline_mode=offline, spark=spark)
File "/home/elad/DNA/dna/dna/model/xgboost_train.py", line 285, in train
predictionCol="prediction"
File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/__init__.py", line 105, in wrapper
return func(self, **kwargs)
File "/tmp/spark-7781039b-6821-42be-96e0-ca4005107318/userFiles-70b3d1de-a78c-4fac-b252-2f99a6761b32/sparkxgb.zip/sparkxgb/xgboost.py", line 115, in __init__
File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception AttributeError: "'NoneType' object has no attribute '_detach'" in <bound method XGBoostEstimator.__del__ of XGBoostEstimator_4f54b37156fb0a113233> ignored
I have no idea why this exception happens nor do I know how to properly integrate sparkxgb into my code.
Help would be appreciated.
thanks
Upvotes: 8
Views: 7158
Reputation: 471
After a day of debugging the hell out of this module, the problem was just submitting the jars incorrectly. I downloaded the jars locally and pyspark-submit them using:
PYSPARK_SUBMIT_ARGS=--jars resources/xgboost4j-0.72.jar,resources/xgboost4j-spark-0.72.jar
This fixed the problem.
Upvotes: 9
Reputation: 1242
Instead of using XGBoost you can try using LightGBM which is a similar and arguably better (at least faster) algorithm. It works pretty much out of the box in pyspark, you can read more here
Upvotes: 3
Reputation: 8172
Newer Apache Spark(2.3.0) version does not have XGBoost. You should try with Pyspark. You must convert your Spark dataframe to pandas dataframe.
This is excellent article that gives workflow and explanation xgboost and spark
Ok,I read again your post and you claim that dataset is too large. May be you should try Apache Arrow. Check this Speeding up Pyspark with Apache Arrow
Upvotes: -1