Reputation: 64719
I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors.
I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor.
Nearly all reinforcement learning algorithms don't seem to be suitable for this domain.
First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step.
Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors.
How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm.
Upvotes: 3
Views: 477
Reputation: 14031
N=10 actors
O=50 objects
L=1K locations
S=50 features
As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. The action space consist of decisions on which actor should be moving which object to which location at any point in time. The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. So, at first glance, you have XS(OL)N action values, with most action dimensions discrete.
The problem as stated is not a good candidate for reinforcement learning. However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. So, let's look at a related, but different problem.
We look at a single actor. Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations.
Suppose, we have a P-dimensional representation of position in the warehouse. Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. The action is to choose an object and a location. Therefore, we have a 4P-dimensional state space and a P2-dimensional action space. In other words, we have a 4PP2-dimensional value function. By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2, it might be possible to bring the problem into the practical realm.
For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2.
1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. Computer Science Department Faculty Publication Series, p.101.
2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. Advances in Neural Information Processing Systems, 18, pp.1015-1023.
Upvotes: 4