Reputation: 111
I am trying to use xgboost to run -using python - on a classification problem, where I have the data in a numpy matrix X (rows = observations & columns = features) and the labels in a numpy array y. Because my data are sparse, I would like to make it run using a sparse version of X, but it seems I am missing something as an error occurs.
Here is what I do :
# Library import
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from scipy.sparse import csr_matrix
# Converting to sparse data and running xgboost
X_csr = csr_matrix(X)
xgb1 = XGBClassifier()
xgtrain = xgb.DMatrix(X_csr, label = y ) #to work with the xgb format
xgtest = xgb.DMatrix(Xtest_csr)
xgb1.fit(xgtrain, y, eval_metric='auc')
dtrain_predictions = xgb1.predict(xgtest)
etc...
Now I get an error when trying to fit the classifier :
File ".../xgboost/python-package/xgboost/sklearn.py", line 432, in fit
self._features_count = X.shape[1]
AttributeError: 'DMatrix' object has no attribute 'shape'
Now, I looked for a while on where it could come from, and believe it has to do with the sparse format I wish to use. But what it is, and how I could fix it, I have no clue.
I would welcome any help or comments ! Thank you very much
Upvotes: 9
Views: 18624
Reputation: 2260
I prefer to use the XGBoost training wrapper as opposed to the XGBoost sklearn wrapper. You can create a classifier as follows:
params = {
# I'm assuming you are doing binary classification
'objective':'binary:logistic'
# any other training params here
# full parameter list here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
}
booster = xgb.train(params, xgtrain, metrics=['auc'])
This API also has a builtin cross validation xgb.cv
that works much better with XGBoost.
https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.cv https://xgboost.readthedocs.io/en/stable/python/examples/cross_validation.html
Tons more examples here https://github.com/dmlc/xgboost/tree/master/demo/guide-python
Hope this helps.
Upvotes: 1
Reputation: 11
Upvotes: 1
Reputation: 91
You are using the xgboost scikit-learn API (http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn), so you don't need to convert your data to a DMatrix to fit the XGBClassifier(). Just removing the line
xgtrain = xgb.DMatrix(X_csr, label = y )
should work:
type(X_csr) #scipy.sparse.csr.csr_matrix
type(y) #numpy.ndarray
xgb1 = xgb.XGBClassifier()
xgb1.fit(X_csr, y)
which outputs:
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
objective='binary:logistic', reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, seed=0, silent=True, subsample=1)
Upvotes: 8
Reputation: 231425
X_csr = csr_matrix(X)
has many of the same properties as X
, including .shape
. But it is not a subclass, and not a drop in replacement. The code needs to be 'sparse-aware'. sklearn
qualifies; in fact it adds a number of its own fast sparse utility functions.
But I don't know how well the xgb
handles sparse matrices, nor how it plays with sklearn
.
Assuming the problem is with xgtrain
, you need to look at its type and properties. How does it compare with the one made with xgb.DMatrix(X, label = y )
?
If you want help from some one who isn't already an xgboost
user, you'll have to provide a lot more information about the objects in your code.
Upvotes: 0