Reputation: 3477
I want to train a classifier with scikit, but for doing this first I need to load the corresponding data. I am using the following data file available in:
https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/
When I open it in word it has the following contents:
ADT1_YEAST 0.58 0.61 0.47 0.13 0.50 0.00 0.48 0.22 MIT
ADT2_YEAST 0.43 0.67 0.48 0.27 0.50 0.00 0.53 0.22 MIT
ADT3_YEAST 0.64 0.62 0.49 0.15 0.50 0.00 0.53 0.22 MIT
AAR2_YEAST 0.58 0.44 0.57 0.13 0.50 0.00 0.54 0.22 NUC
Each file is separated by a double space and every line with a return carriage.
I want to read it with the following command:
f=open("yeast.data")
data = np.loadtxt(f,delimiter=" ")
and at the end I want to be able to use the following:
X = data[:,:-1] # select all columns except the last
y = data[:, -1] # select the last column
for using:
X_train, X_test, y_train, y_test = train_test_split(X, y)
but when I try to read it the following error appears:
ValueError: could not convert string to float: ADT1_YEAST
so how can I read this file in Python to use later the MLPClassifier?
Thanks
Upvotes: 0
Views: 101
Reputation: 51335
You can skip the f=open(...)
, and you can to use dtype='O'
to make sure numpy
reads it as an mix of numericals and strings. Because of some inconsistancies in the data structure in the file you linked, it's best to use genfromtxt
instead of loadtxt
:
data = np.genfromtxt('yeast.data',dtype='O')
>>> data
array([[b'ADT1_YEAST', b'0.58', b'0.61', ..., b'0.48', b'0.22', b'MIT'],
[b'ADT2_YEAST', b'0.43', b'0.67', ..., b'0.53', b'0.22', b'MIT'],
[b'ADT3_YEAST', b'0.64', b'0.62', ..., b'0.53', b'0.22', b'MIT'],
...,
[b'ZNRP_YEAST', b'0.67', b'0.57', ..., b'0.56', b'0.22', b'ME2'],
[b'ZUO1_YEAST', b'0.43', b'0.40', ..., b'0.53', b'0.39', b'NUC'],
[b'G6PD_YEAST', b'0.65', b'0.54', ..., b'0.53', b'0.22', b'CYT']], dtype=object)
>>> data.shape
(1484, 10)
You can change the dtypes when you call genfromtxt
(see documentation), or you can change them manually after like this:
data[:,0] = data[:,0].astype(str)
data[:,1:-1]= data[:,1:-1].astype(float)
data[:,-1] = data[:,-1].astype(str)
>>> data
array([['ADT1_YEAST', 0.58, 0.61, ..., 0.48, 0.22, 'MIT'],
['ADT2_YEAST', 0.43, 0.67, ..., 0.53, 0.22, 'MIT'],
['ADT3_YEAST', 0.64, 0.62, ..., 0.53, 0.22, 'MIT'],
...,
['ZNRP_YEAST', 0.67, 0.57, ..., 0.56, 0.22, 'ME2'],
['ZUO1_YEAST', 0.43, 0.4, ..., 0.53, 0.39, 'NUC'],
['G6PD_YEAST', 0.65, 0.54, ..., 0.53, 0.22, 'CYT']], dtype=object)
Upvotes: 1