Reputation: 3679
I have dataframe with columns Year, month, day,hour, minute, second, Daily_KWH. I need to predict Daily KWH using neural netowrk. Please let me know how to go about it
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
I'm getting the Value Error, when I'm fitting the model.
code so far:
X = df[['year','month','day','hour','minute','second']]
y = df['Daily_KWH_System']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)
#y_train.shape
#X_train.shape
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
#y_train = np.asarray(df['Daily_KWH_System'], dtype="|S6")
mlp.fit(X_train,y_train)
Error:
ValueError: Unknown label type: (array([ 2.27016856e+02, 3.02173014e+03, 4.29404190e+03,
2.41273427e+02, 1.76714247e+02, 4.23374425e+03,
Upvotes: 11
Views: 8663
Reputation: 21
Use a regressor instead. This will solve float 2D data issue.
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(solver='lbfgs',alpha=0.001,hidden_layer_sizes=(10,10))
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
Upvotes: 2
Reputation: 19634
First of all, this is a regression problem and not a classification problem, as the values in the Daily_KWH_System
column do not form a set of labels. Instead, they seem to be (at least based on the provided example) real numbers.
If you want to approach it as a classification problem regardless, then according to sklearn documentation:
When doing classification in scikit-learn, y is a vector of integers or strings.
In your case, y
is a vector of floats, and therefore you get the error. Thus, instead of the line
y = df['Daily_KWH_System']
write the line
y = np.asarray(df['Daily_KWH_System'], dtype="|S6")
and this will resolve the issue. (You can read more about this approach here: Python RandomForest - Unknown label Error)
Yet, as regression is more appropriate in this case, then instead of the above change, replace the lines
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
with
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
The code will run without throwing an error (but there certainly isn't enough data to check whether the model that we get performs well).
With that being said, I don't think that this is the right approach for choosing features for this problem.
In this problem we deal with a sequence of real numbers that form a time series. One reasonable feature that we could choose is the number of seconds (or minutes\hours\days etc) that passed since the starting point. Since this particular data contains only days, months and years (other values are always 0), we could choose as a feature the number of days that passed since the beginning. Then your data frame will look like:
Daily_KWH_System days_passed
0 4136.900384 0
1 3061.657187 1
2 4099.614033 2
3 3922.490275 3
4 3957.128982 4
You could take the values in the column days_passed
as features and the values in Daily_KWH_System
as targets. You may also add some indicator features. For example, if you think that the end of the year may affect the target, you can add an indicator feature that indicates whether the month is December or not.
If the data is indeed daily (at least in this example you have one data point per day) and you want to tackle this problem with neural networks, then another reasonable approach would be to handle it as a time series and try to fit recurrent neural network. Here are couple of great blog posts that describe this approach:
http://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
Upvotes: 14
Reputation: 106
Instead of
mlp.fit(X_train,y_train)
use this
mlp.fit(X_train,y_train.values)
Upvotes: -1
Reputation: 5034
The fit()
function expects y to be 1D list. By slicing a Pandas dataframe you always get a 2D object. This means that for your case, you need to convert the 2D object you got from slicing the DataFrame into an actual 1D list, as expected by fit function:
y = list(df['Daily_KWH_System'])
Upvotes: 2