Reputation: 21
I am absolutely new to machine learning (began day before yesterday) and I have written a python script that hopefully gives me a prediction of a stock price(atleast an estimation). So far I have gathered the data and log transformed the values and then normalized those values and converted them to a dataframe. The code is below:
from scipy import stats
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
import time
df = pd.read_csv('Companies\ADANIPORTS.NS\swing trading\ADANIPORTS.NS.csv')
# convert everything to logarithmic values first to apply central limit theorem. Read about it.
open_log = np.log(df['Open'])
high_log = np.log(df['High'])
low_log = np.log(df['Low'])
close_log = np.log(df['Close'])
df = pd.DataFrame({'Open': open_log,'High': high_log,'Low': low_log,'Close': close_log})
scaler = MinMaxScaler()
scaler.fit(df)
NewData = scaler.transform(df)
pd.set_option('display.max_rows', None)
newdf = pd.DataFrame(NewData,columns=['Open','High','Low','Close'])
newdf.to_csv('logout.csv', index=False)
#X_train, y_train, X_test, y_test = train_test_split(newdf, test_size=0.3, shuffle=False)
train, test = train_test_split(newdf, test_size=0.3, shuffle=False)
print(train)
model = Sequential()
input_layer = Dense(32, input_shape=(4,))
model.add(input_layer)
hidden_layer = Dense(64, activation='relu')
model.add(hidden_layer)
output_layer = Dense(4)
model.add(output_layer)
model.compile(loss='mse', optimizer='rmsprop', metrics = ['accuracy'])
model.fit(train,test,epochs=10, verbose=0)
model.fit(X_train, y_train, epochs=10, validation_split=0.05)
'''
model = Sequential()
model.add(LSTM(units = 50,input_dim = 4))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(output_dim = 1))
model.add(Activation('relu'))
start = time.time()
model.compile(loss='mse', optimizer='rmsprop')
print('compile time', time.time()-start)
model.fit(X_train, y_train, batch_size=512, nb_epoch=1, validation_split=0.05)
predictions = lstm.predict_sequences_multiple(model,X_test,50,50)
lstm.plot_results_multile(predictions,y_test,50)
'''
But everytime I run the code with either model.fit(train,test,epochs=10, verbose=0)
I get an error as
ValueError: Data cardinality is ambiguous:
x sizes: 1875
y sizes: 804
Please provide data which shares the same first dimension.
and if i run with model.fit(X_train, y_train, epoch=10, validation_split=0.05)
I get an error as
X_train, y_train, X_test, y_test = train_test_split(newdf, test_size=0.3, shuffle=False)
ValueError: not enough values to unpack (expected 4, got 2)
Regarding both the errors there seems to be answers on stackoverflow but I can't seem to make them work on my part because of my limited knowledge on ML. So my question is how do i fit the preprocessed data onto the model?
The dataframe looks something like
Open High Low Close
0 0.019199 0.013422 0.037204 0.021447
1 0.025233 0.039041 0.044162 0.045250
2 0.048863 0.070543 0.052112 0.079218
3 0.082475 0.077543 0.088086 0.070864
4 0.070315 0.068797 0.085953 0.070041
5 0.077322 0.098920 0.091625 0.093531
6 0.099061 0.106808 0.112896 0.103979
7 0.091415 0.120864 0.000000 0.130006
8 0.137847 0.129369 0.135259 0.118405
and on and on until row 2678. Fairly straight forward I suppose
Help me. Thanks.
Upvotes: 1
Views: 4030
Reputation: 19322
First part -
model.fit(train,test,epochs=10, verbose=0)
Doesn't make any sense at all. Model.fit method requires each of the samples (row in your x data) and corresponding labels (elements in your y data) for training. If the number of rows is 100, then the number of labels you need to provide for model training is also 100. Passing 'test' doesnt make any sense since that data is held out just for validation of how well your model is generalizable.
Secondly -
I have no idea what your y variable is! You have to create a separate y variable using train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
This is the syntax in which you have to use train_test_split. Your X
is a 2-D matrix that contains your independent variables and your y
is a 1-D array (specific to your problem).
Once you do that, only then your model will be able to function properly.
Lastly -
Your model architecture seems to have an input of 4 dimensions and an output of 4 dimensions. Are you trying to predict 4 numeric values? If not, your output should have a single Dense(1)
I really would encourage you to not play around with code like this and instead spend some time with some keras tutorials first, else you will end up picking up some bad coding habits.
Upvotes: 1
Reputation: 529
The train_test_split
function returning only 2 values, not 4 in your usage. You can use it like below
train, test = train_test_split(newdf, test_size=0.3, shuffle=False)
Or you should give also the labels into train_test_split
script as parameter. In that part, I am not sure which column is your label column.
X_train, y_train, X_test, y_test = train_test_split(newdf, labels, test_size=0.3, shuffle=False)
Upvotes: 0