MegaJiXiang
MegaJiXiang

Reputation: 91

Python How do I format my data so that scikit-learn will allow me to call the .fit(X,y) function?

I wouldn't ask you all to help but I've been trying for many hours to figure out what I'm doing wrong and failing miserably. I'm trying to train a neural network on some data I have collected using the scikit-learn library in python.

Website I'm using as reference: http://scikit-learn.org/stable/modules/neural_networks_supervised.html

My data for training_x ends up being an array of arrays which looks similar to this:

[[0.1, 0.2, -0.1], [0.21, -0.32, 0.3]]

for training_y, it's an array of floats which looks like this: [0.3, 0.2]

training_x = []
training_y = []
for day_offset in range(int((end_date - start_date).days) + 1):
    curr_day = start_date + timedelta(day_offset)
    for company in companies:
        output_training_data(cursor, training_x, training_y, company, curr_day)

clf = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(5, 3), random_state=1)
clf.fit(training_x, training_y)

Then I get the following error:

Traceback (most recent call last):
  File "/Users/jodymcadams/Documents/GitHub/moneygen/create_training_data.py", line 194, in <module>
    main()
  File "/Users/jodymcadams/Documents/GitHub/moneygen/create_training_data.py", line 191, in main
    update_data(app_config, companies)
  File "/Users/jodymcadams/Documents/GitHub/moneygen/create_training_data.py", line 169, in update_data
    update_tweets(app_config, companies)
  File "/Users/jodymcadams/Documents/GitHub/moneygen/create_training_data.py", line 154, in update_tweets
    process_twitter(cursor, companies)
  File "/Users/jodymcadams/Documents/GitHub/moneygen/create_training_data.py", line 136, in process_twitter
    clf.fit(training_x, training_y)
  File "/usr/local/lib/python2.7/site-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit
    return self._fit(X, y, incremental=False)
  File "/usr/local/lib/python2.7/site-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit
    X, y = self._validate_input(X, y, incremental)
  File "/usr/local/lib/python2.7/site-packages/sklearn/neural_network/multilayer_perceptron.py", line 908, in _validate_input
    self._label_binarizer.fit(y)
  File "/usr/local/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 304, in fit
    self.classes_ = unique_labels(y)
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 98, in unique_labels
    raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([ -8.60708650e-04,  -1.63581100e-03,   9.93761387e-04,
         3.86313466e-04,   4.85415472e-04,   9.92796708e-05,
        -7.66657374e-04,  -1.60558464e-03,   2.50678922e-03,
        -9.75813759e-04,  -1.11646082e-03,  -2.30801511e-03,
        -1.48148148e-03,  -2.47524752e-03,   9.89119683e-04,
        -4.94804552e-04,   4.94559842e-04,  -9.90099010e-04,
         2.72479564e-03,  -2.36707939e-03,  -3.64298725e-04,
         1.36425648e-03,  -1.81933958e-04,  -5.12023407e-03,

Upvotes: 1

Views: 358

Answers (1)

Jacob Panikulam
Jacob Panikulam

Reputation: 1218

Your labels must be integers. Float labels cannot be unique'd.

Consider "Classification" the task of finding a mapping from inputs to outputs, which is discrete. Consider "Regression" the task of finding a mapping from inputs to outputs which is continuous. Being that your labels are floats, it looks to me like you're trying to do a regression.

If so, consider using MLPRegressor instead.

Upvotes: 3

Related Questions