Reputation: 111

XGBoost and sparse matrix

I am trying to use xgboost to run -using python - on a classification problem, where I have the data in a numpy matrix X (rows = observations & columns = features) and the labels in a numpy array y. Because my data are sparse, I would like to make it run using a sparse version of X, but it seems I am missing something as an error occurs.

Here is what I do :

# Library import

import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from scipy.sparse import csr_matrix

# Converting to sparse data and running xgboost

X_csr = csr_matrix(X)
xgb1 = XGBClassifier()
xgtrain = xgb.DMatrix(X_csr, label = y )      #to work with the xgb format
xgtest = xgb.DMatrix(Xtest_csr)
xgb1.fit(xgtrain, y, eval_metric='auc')
dtrain_predictions = xgb1.predict(xgtest)

etc...

Now I get an error when trying to fit the classifier :

File ".../xgboost/python-package/xgboost/sklearn.py", line 432, in fit
self._features_count = X.shape[1]

AttributeError: 'DMatrix' object has no attribute 'shape'

Now, I looked for a while on where it could come from, and believe it has to do with the sparse format I wish to use. But what it is, and how I could fix it, I have no clue.

I would welcome any help or comments ! Thank you very much

Upvotes: 9

Answers (4)

volker238

Reputation: 2260

I prefer to use the XGBoost training wrapper as opposed to the XGBoost sklearn wrapper. You can create a classifier as follows:

params = {
    # I'm assuming you are doing binary classification
    'objective':'binary:logistic'
    # any other training params here
    # full parameter list here https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
}
booster = xgb.train(params, xgtrain, metrics=['auc'])

This API also has a builtin cross validation xgb.cv that works much better with XGBoost.

https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.cv https://xgboost.readthedocs.io/en/stable/python/examples/cross_validation.html

Tons more examples here https://github.com/dmlc/xgboost/tree/master/demo/guide-python

Hope this helps.

Upvotes: 1

Amey Laddad

Reputation: 11

The problem occurs due to DMatrix..num_col() only returning the amount of non-zero columns in a sparse matrix.
Convert this matrix to Compressed Sparse Column format using scipy.sparse.coo_matrix.tocsc.
You can refer to http://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543

Upvotes: 1

A.A.

Reputation: 91

You are using the xgboost scikit-learn API (http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn), so you don't need to convert your data to a DMatrix to fit the XGBClassifier(). Just removing the line

xgtrain = xgb.DMatrix(X_csr, label = y )

should work:

type(X_csr) #scipy.sparse.csr.csr_matrix
type(y) #numpy.ndarray
xgb1 = xgb.XGBClassifier()
xgb1.fit(X_csr, y)

which outputs:

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
   gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
   min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
   objective='binary:logistic', reg_alpha=0, reg_lambda=1,
   scale_pos_weight=1, seed=0, silent=True, subsample=1)

Upvotes: 8

hpaulj

Reputation: 231425

X_csr = csr_matrix(X) has many of the same properties as X, including .shape. But it is not a subclass, and not a drop in replacement. The code needs to be 'sparse-aware'. sklearn qualifies; in fact it adds a number of its own fast sparse utility functions.

But I don't know how well the xgb handles sparse matrices, nor how it plays with sklearn.

Assuming the problem is with xgtrain, you need to look at its type and properties. How does it compare with the one made with xgb.DMatrix(X, label = y )?

If you want help from some one who isn't already an xgboost user, you'll have to provide a lot more information about the objects in your code.

Upvotes: 0

XGBoost and sparse matrix

Answers (4)

Related Questions