Reputation: 480
I have a dataframe named data
that has the below given properties:
[880 rows x 10 columns] <class 'pandas.core.frame.DataFrame'> MultiIndex: 880 entries, (123, 456) to (789, 890) Data columns (total 10 columns): Date_Diff 880 non-null float64 Response 880 non-null category Len1 880 non-null int64 Type1 877 non-null category Len2 880 non-null int64 Type2 880 non-null category Len_Diff 880 non-null int64 Same_Institution 880 non-null category Same_Type 880 non-null category Score 880 non-null float64 dtypes: category(5), float64(2), int64(3) memory usage: 82.0+ KB None
Note: The indices on the dataframe are string columns called ID1 and ID2. This is how I set the multiindex: data = data.set_index(['ID1','ID2'], drop = True)
. Since drop = True
, you won't see them in the above dataframe.
I am trying to encode the categorical variables Type1
and Type2
using LabelEncoder
and OneHotEncoder
. This is my code:
# Encoding function
def encode(data):
global cat_columns
cat_columns = list(data.select_dtypes(include=['category','object']))
le = LabelEncoder()
ohe = OneHotEncoder(categorical_features = cat_columns)
for col in cat_columns:
data[col] = le.fit_transform(data[col])
data = ohe.fit_transform(data)
return data
# Use encoding function
encode(data)
I get an IndexError
when I run this code. The error is:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-xxx> in <module>()
14 return data
15
---> 16 encode(data)
<ipython-input-xxx> in encode(data)
---> 13 data = ohe.fit_transform(data)
14 return data
15
/Users/username/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
1900 """
1901 return _transform_selected(X, self._fit_transform,
-> 1902 self.categorical_features, copy=True)
1903
1904 def _transform(self, X):
/Users/username/anaconda2/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
1706 ind = np.arange(n_features)
1707 sel = np.zeros(n_features, dtype=bool)
-> 1708 sel[np.asarray(selected)] = True
1709 not_sel = np.logical_not(sel)
1710 n_selected = np.sum(sel)
IndexError: arrays used as indices must be of integer (or boolean) type
What is causing this error?
I tried removing IDs as indices and tried, still throws the same error.
EDIT: Adding a subset of the dataframe here: Run the html snippet to see it as a table.
Some of the columns' data types have been changed since. The data types are updated in the dataframe properties above.
Response
is the target variable and is categorical.
Same_Institution
andSame_Type
have been changed from integers to categorical binary variables
Type1
andType2
have been changed from pandas objects to categories
<table><tbody><tr><th>ID1</th><th>ID2</th><th>Len1</th><th>Type1</th><th>Len2</th><th>Type2</th><th>Len_Diff</th><th>Date_Diff</th><th>Same_Institution</th><th>Same_Type</th><th>Score</th><th>Response</th></tr><tr><td>121</td><td>977</td><td>10185</td><td>PR</td><td>10185</td><td>MR</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>214</td><td>753</td><td>5039</td><td>MR</td><td>4926</td><td>MR</td><td>113</td><td>9.266666667</td><td>0</td><td>1</td><td>0.997031978</td><td>1</td></tr><tr><td>378</td><td>919</td><td>45404</td><td>PR</td><td>45404</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>283</td><td>685</td><td>821076</td><td>40-F</td><td>412353</td><td>AR</td><td>408723</td><td>0.35</td><td>0</td><td>0</td><td>0.888266653</td><td>0</td></tr><tr><td>452</td><td>837</td><td>16343</td><td>PR</td><td>16343</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>333</td><td>726</td><td>22204</td><td>PR</td><td>20897</td><td>6-K</td><td>1307</td><td>11.3</td><td>0</td><td>0</td><td>0.99251128</td><td>1</td></tr><tr><td>107</td><td>960</td><td>9781</td><td>6-K</td><td>6073</td><td>MR</td><td>3708</td><td>0.483333333</td><td>0</td><td>0</td><td>0.933646747</td><td>0</td></tr><tr><td>236</td><td>768</td><td>3375</td><td>PR</td><td>2945</td><td>MR</td><td>430</td><td>46.58333333</td><td>0</td><td>0</td><td>0.239269675</td><td>0</td></tr><tr><td>419</td><td>829</td><td>81247</td><td>MR</td><td>81247</td><td>MR</td><td>0</td><td>0.016666667</td><td>0</td><td>1</td><td>1</td><td>1</td></tr><tr><td>184</td><td>991</td><td>51474</td><td>PR</td><td>51474</td><td>ER</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>217</td><td>868</td><td>23714</td><td>ER</td><td>26633</td><td>8-K</td><td>2919</td><td>1.716666667</td><td>0</td><td>0</td><td>0.980611207</td><td>1</td></tr><tr><td>202</td><td>622</td><td>4638</td><td>MR</td><td>4638</td><td>PR</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td></tr><tr><td>308</td><td>883</td><td>73476</td><td>ER</td><td>404584</td><td>6-K</td><td>331108</td><td>12.58333333</td><td>0</td><td>0</td><td>0.825482503</td><td>0</td></tr><tr><td>186</td><td>880</td><td>291279</td><td>FIN SUPP</td><td>320893</td><td>6-K</td><td>29614</td><td>4.483333333</td><td>0</td><td>0</td><td>0.991668299</td><td>1</td></tr><tr><td>305</td><td>896</td><td>22988</td><td>PR</td><td>28554</td><td>6-K</td><td>5566</td><td>22.1</td><td>0</td><td>0</td><td>0.941192693</td><td>0</td></tr></tbody></table>
Upvotes: 2
Views: 3243
Reputation: 349
I was encountering the exact same error with OneHotEncoder.
Core issue is that the categorical_features parameter doesn't handle named columns. From OneHotEncoder documentation:
categorical_features : "all" or array of indices or mask
Specify what features are treated as categorical.
- 'all' (default): All features are treated as categorical.
- array of indices: Array of categorical feature indices.
- mask: Array of length n_features and with dtype=bool.
What worked for me was to generate a boolean mask first using a snippet like:
cat_columns = list(data.select_dtypes(include=['category','object']))
column_mask = []
for column_name in list(data.columns.values):
column_mask.append(column_name in cat_columns)
# And then pass the column_mask into the OneHotEncoder
ohe = OneHotEncoder(categorical_features = column_mask)
So your original function would be:
# Encoding function
def encode(data):
global cat_columns
cat_columns = list(data.select_dtypes(include=['category','object']))
column_mask = []
for column_name in list(data.columns.values):
column_mask.append(column_name in cat_columns)
le = LabelEncoder()
ohe = OneHotEncoder(categorical_features = column_mask)
for col in cat_columns:
data[col] = le.fit_transform(data[col])
data = ohe.fit_transform(data)
return data
# Use encoding function
encode(data)
Upvotes: 2