Reputation: 171
The following has completed using MATLAB.
I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.
My state space is my [money, stock, price]
money
is the amount of cash I have,
stock
is the number of stocks I have, and
price
is the price of the stock at that time step.
The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }
.
My reward function is the difference between the value of portfolio value in the current time step and the previous time step.
But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?
I am using a neural network to approximate the q-values. It has three inputs
[money, stock, price]
and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.
Can anyone shed some light on the how can I reduce this to 3 actions?
My code is :
% p is the stock price
% sp is the stock price at the next time interval
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidden_layers = 1;
actions = 202;
net = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
[hidden_layers, actions],
{'tansig','purelin'},
'trainlm'
);
net = init( net );
net.trainParam.showWindow = false;
% neural network training parameters -----------------------------------
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.1;
net.trainParam.epochs = 100;
% parameters for q learning --------------------------------------------
epsilon = 0.8;
gamma = 0.95;
max_episodes = 1000;
max_iterations = length( p ) - 1;
reset = false;
inital_money = 1000;
inital_stock = 0;
%These will be where I save the outputs
save_s = zeros( max_iterations, max_episodes );
save_pt = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a = zeros( max_iterations, max_episodes );
% construct the inital state -------------------------------------------
% a = randi( [1 3], 1, 1 );
s = [inital_money;inital_stock;p( 1, 1 )];
% construct initial q matrix -------------------------------------------
Qs = zeros( 1, actions );
Qs_prime = zeros( 1, actions );
for i = 1:max_episodes
for j = 1:max_iterations % max_iterations --------------
Qs = net( s );
%% here we will choose an action based on epsilon-greedy strategy
if ( rand() <= epsilon )
[Qs_value a] = max(Qs);
else
a = randi( [1 202], 1, 1 );
end
a2 = a - 101;
save_a(j,i) = a2;
sp = p( j+1, 1 ) ;
pt = s( 1 ) + s( 2 ) * p( j, 1 );
save_pt(j,i) = pt;
[s_prime,reward] = simulateStock( s, a2, pt, sp );
Qs_prime = net( s_prime );
Q_target = reward + gamma * max( Qs_prime );
save_Q_target(j,i) = Q_target;
Targets = Qs;
Targets( a ) = Q_target;
save_s( j, i ) = s( 1 );
s = s_prime;
end
epsilon = epsilon * 0.99 ;
reset = false;
s = [inital_money;inital_stock;p(1,1)];
end
% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
money = s(1);
stock = s(2);
price = s(3);
money = money - a * price ;
money = max( money, 0 );
stock = s(2) + a;
stock = max( stock, 0 );
s_prime = [money;stock;sp];
reward = ( money + stock * price ) - pt;
end
Upvotes: 3
Views: 3358
Reputation: 1
You may be right, that using a range of just { buy | hold | sell }
actions is a frequent habit for academic papers, where authors sometimes decide to illustrate their demonstrated academic efforts on improving learning / statistical methods and opt to pick an exemplary application in a trading domain. The pity is, this could be done in academic papers, but not in the reality of trading.
Even with an elementary view on trading, the problem is much more complex. As a brief reference, there are more than five principal domains of such model-space. Given a trading is to be modelled, one cannot remain without a fully described strategy --
Tru-Strategy := { SelectPOLICY,
DetectPOLICY,
ActPOLICY,
AllocatePOLICY,
TerminatePOLICY
}
Any whatever motivated simplification, that would opt to omit any single one domain of these five principal domains will become whatever but a truly Trading Strategy.
One can easily figure out, what comes out of just training ( the worse from later harnessing such model in doing real trades with ) an ill-defined model, that is not coherent with the reality.
Sure, it can reach ( and will ( again, unless ill-formulated minimiser's criterion function ) ) some mathematical function's minimum, but that does not ensure the reality to immediately change it's so far natural behaviours and to start "obey" the ill-defined model and to "dance" according to such oversimplified or otherwise skewed ( ill-modelled )-opinions about the reality.
If in doubts what this means, try to follow an example:
Today, the Strategy-Model decides to A:Buy(AAPL,67)
.
Tomorrow, AAPL
goes down, some 0.1% and thus the immediate reward ( as was proposed above ) is negative, thus punishing such decision. The Model is stimulated not to do it ( do not buy AAPL ).
The point is, that after some period of time, AAPL rises much higher, producing much higher reward compared to initial fluctuations in D2D Close
, which is known, but the proposed Strategy-Model Q-fun simply principally erroneously did not reflect at all.
This means an as-is-Model could be trained to act according to the such defined stimuli, but it's actual behaviour will favour NOTHING but such extremely naive intraday "quasi-scalping" shots with limited ( if any at all ) support from actual Market State & Market Dynamics as are available by many industry-wide accepted quantitative models.
So, sure, one can train a reality-blind model, that was kept blind & deaf ( ignoring the reality of the Problem Domain ), but for what sake?
Epilogue:
There is nothing like a "Data Science"
even when MarCom & HR beat their drums & whistles, as they indeed do a lot nowadays
Why?
Exactly because the above observed rationale. Having data-points is nothing. Sure, it is better than standing clueless in front of the customer without a single observation of the reality, but the Data-points do not save the game.
It is the domain-knowledge, that starts to make some sense from the Data-points, not the Data-points per se.
If still in doubts, if one has a few terabytes of numbers, there is no Data Science to tell you, what the data-points represent.
On the other hand, if one knows, from the domain-specific context, these data-points ought be temperature readings, there is still no Data-Science God to tell you, whether there are all ( just by coincidence ) in [°K] or [°C] ( if there are just positive readings >= 0.00001 ).
Upvotes: 3