A. R.
A. R.

Reputation: 49

Keras Index out of Bounds Error for csv database

this is my first time posting on StackOverflow after quite some time using the site.

I have been trying to predict the last column of a practice machine learning database from this link http://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#

I ran the code below and received this error:

Traceback (most recent call last):

File "", line 1, in runfile('/Users/ric4711/diabetes_tensorflow', wdir='/Users/ric4711')

File "/Users/ric4711/anaconda/lib/python2.7/site-packages/spyder/utils/site/sitecustomize.py", line 880, in runfile execfile(filename, namespace)

File "/Users/ric4711/anaconda/lib/python2.7/site-packages/spyder/utils/site/sitecustomize.py", line 94, in execfile builtins.execfile(filename, *where)

File "/Users/ric4711/diabetes_tensorflow", line 60, in y_train = to_categorical(y_train, num_classes = num_classes)

File "/Users/ric4711/anaconda/lib/python2.7/site-packages/keras/utils/np_utils.py", line 25, in to_categorical categorical[np.arange(n), y] = 1

IndexError: index 3 is out of bounds for axis 1 with size 3

I suspect there might be an issue in my dimensions of the y axis or how I'm managing categories for this. Any help would be greatly appreciated.

from pandas import read_csv
import numpy
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras.layers import Dense, Input
from keras.models import Model

dataset = read_csv(r"/Users/ric4711/Documents/dataset_diabetes/diabetic_data.csv", header=None)
#Column 2, 5, 10, 11, 18, 19, 20 all have "?" 
#(101767, 50) size of dataset
#PROBLEM COLUMNS WITH NUMBER OF "?"
#2      2273
#5     98569
#10    40256
#11    49949
#18       21
#19      358
#20     1423
le=LabelEncoder()

dataset[[2,5,10,11,18,19,20]] = dataset[[2,5,10,11,18,19,20]].replace("?", numpy.NaN)

dataset = dataset.drop(dataset.columns[[0, 1, 5, 10, 11]], axis=1)
dataset.dropna(inplace=True)


y = dataset[[49]]
X = dataset.drop(dataset.columns[[44]], 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

for col in X_test.columns.values:
    if X_test[col].dtypes=='object':
        data=X_train[col].append(X_test[col])
        le.fit(data.values)
        X_train[col]=le.transform(X_train[col])
        X_test[col]=le.transform(X_test[col])
        
for col in y_test.columns.values:
    if y_test[col].dtypes=='object':
        data=y_train[col].append(y_test[col])
        le.fit(data.values)
        y_train[col]=le.transform(y_train[col])
        y_test[col]=le.transform(y_test[col])
        

batch_size = 500
num_epochs = 300
hidden_size = 250

num_test = X_test.shape[0]
num_training = X_train.shape[0]
height, width, depth = 1, X_train.shape[1], 1
num_classes = 3

y_train = y_train.as_matrix()
y_test = y_test.as_matrix()

y_train = to_categorical(y_train, num_classes = num_classes)
y_test = to_categorical(y_test, num_classes = num_classes)

inp = Input(shape=(height * width,))
hidden_1 = Dense(hidden_size, activation='tanh')(inp) 
hidden_2 = Dense(hidden_size, activation='tanh')(hidden_1)
hidden_3 = Dense(hidden_size, activation='tanh')(hidden_2)
hidden_4 = Dense(hidden_size, activation='tanh')(hidden_3)
hidden_5 = Dense(hidden_size, activation='tanh')(hidden_4)
hidden_6 = Dense(hidden_size, activation='tanh')(hidden_5)
hidden_7 = Dense(hidden_size, activation='tanh')(hidden_6)
hidden_8 = Dense(hidden_size, activation='tanh')(hidden_7)
hidden_9 = Dense(hidden_size, activation='tanh')(hidden_8)
hidden_10 = Dense(hidden_size, activation='tanh')(hidden_9)
hidden_11 = Dense(hidden_size, activation='tanh')(hidden_10)
out = Dense(num_classes, activation='softmax')(hidden_11) 


model = Model(inputs=inp, outputs=out) 

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy']) 


model.fit(X_train, y_train, batch_size = batch_size,epochs = num_epochs, validation_split = 0.1, shuffle = True)

model.evaluate(X_test, y_test, verbose=1) 

Upvotes: 2

Views: 3059

Answers (1)

A. R.
A. R.

Reputation: 49

I fixed this by changing the num_classes to 4, and in the .fit method applying numpy.array(X_train), numpy.array(y_train)

Upvotes: 2

Related Questions