RajeshS
RajeshS

Reputation: 99

Is reinforcement learning applicable to a RANDOM environment?

I have a fundamental question on the applicability of reinforcement learning (RL) on a problem we are trying to solve.

We are trying to use RL for inventory management - where the demand is entirely random (it probably has a pattern in real life but for now let us assume that we have been forced to treat as purely random).

As I understand, RL can help learn how to play a game (say chess) or help a robot learn to walk. But all games have rules and so does the ‘cart-pole’ (of OpenAI Gym) – there are rules of ‘physics’ that govern when the cart-pole will tip and fall over.

For our problem there are no rules – the environment changes randomly (demand made for the product).

Is RL really applicable to such situations?

If it does - then what will improve the performance?

Further details: - The only two stimuli available from the ‘environment’ are the currently available level of product 'X' and the current demand 'Y' - And the ‘action’ is binary - do I order a quantity 'Q' to refill or do I not (discrete action space). - We are using DQN and an Adam optimizer.

Our results are poor - I admit I have trained only for about 5,000 or 10,000 - should I let it train on for days because it is a random environment?

thank you Rajesh

Upvotes: 3

Views: 2777

Answers (2)

Priyanthi
Priyanthi

Reputation: 1785

Randomness can be handled by replacing single average reward output with a distribution with possible values. By introducing a new learning rule, reflecting the transition from Bellman’s (average) equation to its distributional counterpart, the Value distribution approach has been able to surpass the performance of all other comparable approaches.

https://www.deepmind.com/blog/going-beyond-average-for-reinforcement-learning

Upvotes: 2

mimoralea
mimoralea

Reputation: 9986

You are saying random in the sense of non-stationary, so, no, RL is not the best here.

Reinforcement learning assumes your environment is stationary. The underlying probability distribution of your environment (both transition and reward function) must be held constant throughout the learning process.

Sure, RL and DRL can deal with some slightly non-stationary problems, but it struggles at that. Markov Decision Processes (MDPs) and Partially-Observable MDPs assume stationarity. So value-based algorithms, which are specialized in exploiting MDP-like environments, such as SARSA, Q-learning, DQN, DDQN, Dueling DQN, etc., will have a hard time learning anything in non-stationary environments. The more you go towards policy-based algorithms, such as PPO, TRPO, or even better gradient-free, such as GA, CEM, etc., the better chance you have as these algorithms don't try to exploit this assumption. Also, playing with the learning rate would be essential to make sure the agent never stops learning.

Your best bet is to go towards black-box optimization methods such as Genetic Algorithms, etc.

Upvotes: 4

Related Questions