usman Farooq
usman Farooq

Reputation: 171

Trading algorithm - actions in Q-learning/DQN

The following has completed using MATLAB.

I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.

My state space is my [money, stock, price]
money is the amount of cash I have,
stock is the number of stocks I have, and
price is the price of the stock at that time step.

The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }.

My reward function is the difference between the value of portfolio value in the current time step and the previous time step.

But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?

I am using a neural network to approximate the q-values. It has three inputs [money, stock, price] and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.

Can anyone shed some light on the how can I reduce this to 3 actions?

My code is :

%  p is the stock price
% sp is the stock price at the next time interval 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

hidden_layers =   1;
actions       = 202;
net           = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
                       [hidden_layers, actions],
                       {'tansig','purelin'},
                       'trainlm'
                       );

net           = init( net );

net.trainParam.showWindow = false;

% neural network training parameters -----------------------------------
net.trainParam.lr     =   0.01;
net.trainParam.mc     =   0.1;
net.trainParam.epochs = 100;

% parameters for q learning --------------------------------------------
epsilon        =    0.8;
gamma          =    0.95;
max_episodes   = 1000;
max_iterations = length( p ) - 1;

reset          =    false;
inital_money   = 1000;
inital_stock   =    0;

%These will be where I save the outputs
save_s        = zeros( max_iterations, max_episodes );
save_pt       = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a        = zeros( max_iterations, max_episodes );

% construct the inital state -------------------------------------------
% a           = randi( [1 3], 1, 1 );  
s             = [inital_money;inital_stock;p( 1, 1 )];


% construct initial q matrix -------------------------------------------
Qs            = zeros( 1, actions );
Qs_prime      = zeros( 1, actions );


for     i = 1:max_episodes
    for j = 1:max_iterations             % max_iterations --------------

        Qs = net( s );

        %% here we will choose an action based on epsilon-greedy strategy

        if ( rand() <= epsilon )
            [Qs_value  a] = max(Qs);
        else 
            a = randi( [1 202], 1, 1 );
        end

        a2                 = a - 101;
        save_a(j,i)        = a2;
        sp                 = p( j+1, 1 ) ; 
        pt                 = s( 1 ) + s( 2 ) * p( j, 1 );
        save_pt(j,i)       = pt; 
        [s_prime,reward]   = simulateStock( s, a2, pt, sp );

        Qs_prime           = net( s_prime );

        Q_target           = reward + gamma * max( Qs_prime );
        save_Q_target(j,i) = Q_target;
        Targets            = Qs;

        Targets( a )       =  Q_target;

        save_s( j, i )     = s( 1 );
        s                  = s_prime;
    end

    epsilon = epsilon * 0.99 ; 
    reset   = false; 
    s       = [inital_money;inital_stock;p(1,1)];
end

% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
                           money   = s(1);
                           stock   = s(2);
                           price   = s(3);

                           money   = money - a * price ;
                           money   = max( money, 0 );
                           stock   = s(2) + a;
                           stock   = max( stock, 0 );

                           s_prime = [money;stock;sp];
                           reward  = ( money + stock * price ) - pt;
end

Upvotes: 3

Views: 3358

Answers (1)

user3666197
user3666197

Reputation: 1

Actions: ill-defined
( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )

You may be right, that using a range of just { buy | hold | sell } actions is a frequent habit for academic papers, where authors sometimes decide to illustrate their demonstrated academic efforts on improving learning / statistical methods and opt to pick an exemplary application in a trading domain. The pity is, this could be done in academic papers, but not in the reality of trading.

Why?

Even with an elementary view on trading, the problem is much more complex. As a brief reference, there are more than five principal domains of such model-space. Given a trading is to be modelled, one cannot remain without a fully described strategy --

Tru-Strategy := {    SelectPOLICY,
                     DetectPOLICY,
                        ActPOLICY,
                   AllocatePOLICY,
                  TerminatePOLICY
                  }

Any whatever motivated simplification, that would opt to omit any single one domain of these five principal domains will become whatever but a truly Trading Strategy.

One can easily figure out, what comes out of just training ( the worse from later harnessing such model in doing real trades with ) an ill-defined model, that is not coherent with the reality.

Sure, it can reach ( and will ( again, unless ill-formulated minimiser's criterion function ) ) some mathematical function's minimum, but that does not ensure the reality to immediately change it's so far natural behaviours and to start "obey" the ill-defined model and to "dance" according to such oversimplified or otherwise skewed ( ill-modelled )-opinions about the reality.


Rewards: ill-defined
( if not giving a reason for ignoring the fact or delayed rewards )

If in doubts what this means, try to follow an example:
Today, the Strategy-Model decides to A:Buy(AAPL,67).
Tomorrow, AAPL goes down, some 0.1% and thus the immediate reward ( as was proposed above ) is negative, thus punishing such decision. The Model is stimulated not to do it ( do not buy AAPL ).

The point is, that after some period of time, AAPL rises much higher, producing much higher reward compared to initial fluctuations in D2D Close, which is known, but the proposed Strategy-Model Q-fun simply principally erroneously did not reflect at all.

Beware WYTIWYG -- What You Train Is What You Get ...

This means an as-is-Model could be trained to act according to the such defined stimuli, but it's actual behaviour will favour NOTHING but such extremely naive intraday "quasi-scalping" shots with limited ( if any at all ) support from actual Market State & Market Dynamics as are available by many industry-wide accepted quantitative models.

So, sure, one can train a reality-blind model, that was kept blind & deaf ( ignoring the reality of the Problem Domain ), but for what sake?


Epilogue:

There is nothing like a "Data Science"
even when MarCom & HR beat their drums & whistles, as they indeed do a lot nowadays


Why?

Exactly because the above observed rationale. Having data-points is nothing. Sure, it is better than standing clueless in front of the customer without a single observation of the reality, but the Data-points do not save the game.

It is the domain-knowledge, that starts to make some sense from the Data-points, not the Data-points per se.

If still in doubts, if one has a few terabytes of numbers, there is no Data Science to tell you, what the data-points represent.

On the other hand, if one knows, from the domain-specific context, these data-points ought be temperature readings, there is still no Data-Science God to tell you, whether there are all ( just by coincidence ) in [°K] or [°C] ( if there are just positive readings >= 0.00001 ).

Upvotes: 3

Related Questions