Trading algorithm - actions in Q-learning/DQN

Question

The following has completed using MATLAB.

I am trying to build a trading algorithm using Deep Q learning. I have just taken a years worth of daily stock prices and am using that as the training set.

My state space is my [money, stock, price]
money is the amount of cash I have,
stock is the number of stocks I have, and
price is the price of the stock at that time step.

The issue I am having is with the actions; looking online, people only have three actions, { buy | sell | hold }.

My reward function is the difference between the value of portfolio value in the current time step and the previous time step.

But using just three actions, I am unsure how to choose to buy, lets say 67 stocks at the price?

I am using a neural network to approximate the q-values. It has three inputs [money, stock, price] and 202 outputs, i.e. I can sell between 0 and 100 number of stock, 0, I can hold the stock, or I can buy 1-100 stock.

Can anyone shed some light on the how can I reduce this to 3 actions?

My code is :

%  p is the stock price
% sp is the stock price at the next time interval 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

hidden_layers =   1;
actions       = 202;
net           = newff( [-1000000 1000000;-1000000 1000000;0 1000;],
                       [hidden_layers, actions],
                       {'tansig','purelin'},
                       'trainlm'
                       );

net           = init( net );

net.trainParam.showWindow = false;

% neural network training parameters -----------------------------------
net.trainParam.lr     =   0.01;
net.trainParam.mc     =   0.1;
net.trainParam.epochs = 100;

% parameters for q learning --------------------------------------------
epsilon        =    0.8;
gamma          =    0.95;
max_episodes   = 1000;
max_iterations = length( p ) - 1;

reset          =    false;
inital_money   = 1000;
inital_stock   =    0;

%These will be where I save the outputs
save_s        = zeros( max_iterations, max_episodes );
save_pt       = zeros( max_iterations, max_episodes );
save_Q_target = zeros( max_iterations, max_episodes );
save_a        = zeros( max_iterations, max_episodes );

% construct the inital state -------------------------------------------
% a           = randi( [1 3], 1, 1 );  
s             = [inital_money;inital_stock;p( 1, 1 )];


% construct initial q matrix -------------------------------------------
Qs            = zeros( 1, actions );
Qs_prime      = zeros( 1, actions );


for     i = 1:max_episodes
    for j = 1:max_iterations             % max_iterations --------------

        Qs = net( s );

        %% here we will choose an action based on epsilon-greedy strategy

        if ( rand() <= epsilon )
            [Qs_value  a] = max(Qs);
        else 
            a = randi( [1 202], 1, 1 );
        end

        a2                 = a - 101;
        save_a(j,i)        = a2;
        sp                 = p( j+1, 1 ) ; 
        pt                 = s( 1 ) + s( 2 ) * p( j, 1 );
        save_pt(j,i)       = pt; 
        [s_prime,reward]   = simulateStock( s, a2, pt, sp );

        Qs_prime           = net( s_prime );

        Q_target           = reward + gamma * max( Qs_prime );
        save_Q_target(j,i) = Q_target;
        Targets            = Qs;

        Targets( a )       =  Q_target;

        save_s( j, i )     = s( 1 );
        s                  = s_prime;
    end

    epsilon = epsilon * 0.99 ; 
    reset   = false; 
    s       = [inital_money;inital_stock;p(1,1)];
end

% ----------------------------------------------------------------------
function[s_prime,reward] = simulateStock( s, a, pt, sp )
                           money   = s(1);
                           stock   = s(2);
                           price   = s(3);

                           money   = money - a * price ;
                           money   = max( money, 0 );
                           stock   = s(2) + a;
                           stock   = max( stock, 0 );

                           s_prime = [money;stock;sp];
                           reward  = ( money + stock * price ) - pt;
end

Trading algorithm - actions in Q-learning/DQN

Answers (1)

Actions: ill-defined
^{( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )}

Why?

Rewards: ill-defined
^{( if not giving a reason for ignoring the fact or delayed rewards )}

Beware WYTIWYG -- What You Train Is What You Get ...

Related Questions

Trading algorithm - actions in Q-learning/DQN

Answers (1)

Actions: ill-defined ( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )

Why?

Rewards: ill-defined ( if not giving a reason for ignoring the fact or delayed rewards )

Beware WYTIWYG -- What You Train Is What You Get ...

Related Questions

Actions: ill-defined
^{( if not giving an ultimate reason for so flattened, decaffeinated & knowingly short-cut model )}

Rewards: ill-defined
^{( if not giving a reason for ignoring the fact or delayed rewards )}