Gokul Y
Gokul Y

Reputation: 115

lightgbm || ValueError: Series.dtypes must be int, float or bool

Dataframe has filled na values .

Schema of dataset has no object dtype as specified in documentation.

df.info() 

output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 429 entries, 351 to 559
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   Gender             429 non-null    category
 1   Married            429 non-null    category
 2   Dependents         429 non-null    category
 3   Education          429 non-null    category
 4   Self_Employed      429 non-null    category
 5   ApplicantIncome    429 non-null    int64   
 6   CoapplicantIncome  429 non-null    float64 
 7   LoanAmount         429 non-null    float64 
 8   Loan_Amount_Term   429 non-null    float64 
 9   Credit_History     429 non-null    float64 
 10  Property_Area      429 non-null    category
dtypes: category(6), float64(4), int64(1)
memory usage: 23.3 KB

I have following code .....................................................................................................................................................................................................................................................................................................................


import lightgbm as lgb

train_data=lgb.Dataset(x_train,label=y_train,categorical_feature=cat_cols)


#define parameters
params = {'learning_rate':0.001}


model= lgb.train(params, train_data, 100,categorical_feature=cat_cols) 


getting following error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-178-aaa91a2d8719> in <module>
      6 
      7 
----> 8 model= lgb.train(params, train_data, 100,categorical_feature=cat_cols)

~\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    229     # construct booster
    230     try:
--> 231         booster = Booster(params=params, train_set=train_set)
    232         if is_valid_contain_train:
    233             booster.set_train_data_name(train_data_name)

~\Anaconda3\lib\site-packages\lightgbm\basic.py in __init__(self, params, train_set, model_file, model_str, silent)
   1981                     break
   1982             # construct booster object
-> 1983             train_set.construct()
   1984             # copy the parameters from train_set
   1985             params.update(train_set.get_params())

~\Anaconda3\lib\site-packages\lightgbm\basic.py in construct(self)
   1319             else:
   1320                 # create train
-> 1321                 self._lazy_init(self.data, label=self.label,
   1322                                 weight=self.weight, group=self.group,
   1323                                 init_score=self.init_score, predictor=self._predictor,

~\Anaconda3\lib\site-packages\lightgbm\basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1133                 raise TypeError('Cannot initialize Dataset from {}'.format(type(data).__name__))
   1134         if label is not None:
-> 1135             self.set_label(label)
   1136         if self.get_label() is None:
   1137             raise ValueError("Label should not be None")

~\Anaconda3\lib\site-packages\lightgbm\basic.py in set_label(self, label)
   1648         self.label = label
   1649         if self.handle is not None:
-> 1650             label = list_to_1d_numpy(_label_from_pandas(label), name='label')
   1651             self.set_field('label', label)
   1652             self.label = self.get_field('label')  # original values can be modified at cpp side

~\Anaconda3\lib\site-packages\lightgbm\basic.py in list_to_1d_numpy(data, dtype, name)
     88     elif isinstance(data, Series):
     89         if _get_bad_pandas_dtypes([data.dtypes]):
---> 90             raise ValueError('Series.dtypes must be int, float or bool')
     91         return np.array(data, dtype=dtype, copy=False)  # SparseArray should be supported as well
     92     else:

ValueError: Series.dtypes must be int, float or bool

Upvotes: 8

Views: 14271

Answers (2)

Vojtech Stas
Vojtech Stas

Reputation: 751

I had the same problem. My y_train was in int64 dtype. This solved my problem:

model_LGB.fit(
    X = X_train,
    y = y_train.astype('int32'))

Upvotes: 1

Patrick Bormann
Patrick Bormann

Reputation: 749

did anyone helped you yet? If not: The answer lies within transforming your variable.

Go to this link:GitHub Discussion lightGBM

The creators of LightGBM were confronted with that same question once. In the Link above they (STRIKER) tell you, that you should: transform your variables with astype("category") (pandas/scikit) AND you should labelEncode them, because you need an INT ! value in your feature column, especially an INT32.

However, labelEncoding and astype('category') should normally do the same: Encoding

Antoher useful link is this advanced doc about the categorical feature:Categorical feature light gbm homepage where they tell you that they cant deal with object(string) dtypes as in your data.

If you are still feeling uncomfortable with this explanation, here is my code snippet from the kaggle space_race_set. If you are still having problems. Just ask away.

cat_feats = ['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country']
labelencoder = LabelEncoder()

for col in cat_feats:
    train_df[col] = labelencoder.fit_transform(train_df[col])

for col in cat_feats:
    train_df[col] = train_df[col].astype('int')
    

y = train_df[["Status Mission"]]
X = train_df.drop(["Status Mission"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)    

train_data = lgb.Dataset(X_train, 
                         label=y_train, 
                         categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'], 
                         free_raw_data=False)
test_data = lgb.Dataset(X_test, 
                        label=y_test, 
                        categorical_feature=['Company Name', 'Night_and_Day', 'Rocket Type', 'Rocket Mission Type', 'State', 'Country'], 
                        free_raw_data=False)
 

Upvotes: 14

Related Questions