Chris Rigano
Chris Rigano

Reputation: 701

Problems using genfromtxt to input into scikit-learn fit function

I am trying to genfromtxt to read in a csv file and then use RandomForestClassifier. I wind up having use genfromtxt twice;once to read in the features and then to get the proper format. The code for this attempt follows: import csv import numpy as np

data = np.genfromtxt('plants.csv',dtype=float, delimiter=',', names=True)
feature_names = np.array(data.dtype.names)
feature_names = feature_names[[ 0,1,2,3,4]] 

data = np.genfromtxt('plants.csv',dtype=float, delimiter=',', skip_header=1)
plants_X = data[:, [0,1,2,3,4]] 
_y = np.ravel(data[:,[5]]) #Return a flattened array required by scikit-learn fit for 2nd argument

from sklearn.ensemble import RandomForestClassifier 
clf = RandomForestClassifier( n_estimators = 10, random_state = 33)
clf = clf.fit(plants_X, plants_y)

print feature_names, '\n', clf.feature_importances_

print feature_names, '\n', clf.feature_importances_

When I use genfromtxt with the "names=True option "data" read in is not in the format I expected!

" ([(31.194181, 0.0, 0.0, 0.0, 1.0, 1.0), (12.0, 0.0, 0.0, 1.0, 0.0, 1.0), (18.0, 1.0, 0.0, 1.0, 0.0, 0.0), (31.194181, 0.0, 0.0, 0.0, 1.0, 0.0)], ... dtype=[('A', '

I want to get the feature names from the file without reading it twice!

Thanks for your assistance!

Ps: Thnaks to "Cyborg" I got this far!

Upvotes: 0

Views: 279

Answers (1)

Andreas Mueller
Andreas Mueller

Reputation: 28788

I recomment to use pandas for this. You can use pandas.read_csv to get a pandas dataframe with column names. You need to convert the data to a numpy array to pass it to scikit-learn, though.

Upvotes: 2

Related Questions