Reputation: 701
I am trying to genfromtxt to read in a csv file and then use RandomForestClassifier. I wind up having use genfromtxt twice;once to read in the features and then to get the proper format. The code for this attempt follows: import csv import numpy as np
data = np.genfromtxt('plants.csv',dtype=float, delimiter=',', names=True)
feature_names = np.array(data.dtype.names)
feature_names = feature_names[[ 0,1,2,3,4]]
data = np.genfromtxt('plants.csv',dtype=float, delimiter=',', skip_header=1)
plants_X = data[:, [0,1,2,3,4]]
_y = np.ravel(data[:,[5]]) #Return a flattened array required by scikit-learn fit for 2nd argument
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier( n_estimators = 10, random_state = 33)
clf = clf.fit(plants_X, plants_y)
print feature_names, '\n', clf.feature_importances_
print feature_names, '\n', clf.feature_importances_
When I use genfromtxt with the "names=True option "data" read in is not in the format I expected!
" ([(31.194181, 0.0, 0.0, 0.0, 1.0, 1.0), (12.0, 0.0, 0.0, 1.0, 0.0, 1.0), (18.0, 1.0, 0.0, 1.0, 0.0, 0.0), (31.194181, 0.0, 0.0, 0.0, 1.0, 0.0)], ... dtype=[('A', '
I want to get the feature names from the file without reading it twice!
Thanks for your assistance!
Ps: Thnaks to "Cyborg" I got this far!
Upvotes: 0
Views: 279
Reputation: 28788
I recomment to use pandas for this.
You can use pandas.read_csv
to get a pandas dataframe with column names. You need to convert the data to a numpy array to pass it to scikit-learn, though.
Upvotes: 2