Reputation: 191
I am trying to write an adaptive controller for a control system, namely a power management system using Q-learning. I recently implemented a toy RL problem for the cart-pole system and worked out the formulation of the helicopter control problem from Andrew NG's notes. I appreciate how value function approximation is imperative in such situations. However both these popular examples have very small number of possible discrete actions. I have three questions:
1) What is the correct way to handle such problems if you don't have a small number of discrete actions? The dimensionality of my actions and states seems to have blown up and the learning looks very poor, which brings me to my next question.
2) How do I measure the performance of my agent? Since the reward changes in conjunction with the dynamic environment, at every time-step I can't decide the performance metrics for my continuous RL agent. Also unlike gridworld problems, I can't check the Q-value table due to huge state-action pairs, how do I know my actions are optimal?
3) Since I have a model for the evoluation of states through time. States = [Y, U]. Y[t+1] = aY[t] + bA, where A is an action. Choosing discretization step for actions A will also affect how finely I have to discretize my state variable Y. How do I choose my discretization steps? Thanks a lot!
Upvotes: -1
Views: 1166
Reputation: 5422
Have a look at policy search algorithms. Basically, they directly learn a parametric policy without an explicit value function, thus avoiding the problem of approximating the Q-function for continuous actions (eg, no discretization of the action space is needed).
One of the easiest and earliest policy search algorithm is policy gradient. Have a look here for a quick survey about the topic. And here for a survey about policy search (currently, there are more recent techniques, but that's a very good starting point). In the case of control problem, there is a very simple toy task you can look, the Linear Quadratic Gaussian Regulator (LQG). Here you can find a lecture including this example and also an introduction to policy search and policy gradient.
Regarding your second point, if your environment is dynamic (that is, the reward function of the transition function (or both) change through time), then you need to look at non-stationary policies. That's typically a much more challenging problem in RL.
Upvotes: 2
Reputation: 4318
You may use a continuous action reinforcement learning algorithm and completely avoid the discretization issue. I'd suggest you to take a look at CACLA. As for the performance, you need to measure your agent's accumulated reward during an episode with learning turned off. Since your environment is stochastic, take many measurements and average them.
Upvotes: 3