Reputation: 944
I am trying to implement an agent which uses Q-learning to play Ludo. I've trained it with an e-greedy action selector, with an epsilon of 0.1, and a learning rate of 0.6, and discount factor of 0.8.
I ran the game for around 50K steps, and haven't won a single game. This is puzzling as the Q table seem to be pretty accurate with what I want it to be. Why am I losing so much to random players? Shouldn't the system be able to win if the Q-table isn't changing that much, and in general how many iterations would I have to train my agent?
I am not sure how much information is needed, I will update the post with relevant information if needed.
Possible states, represented as rows in the Q-table:
Possible actions, represented as columns for each state:
I start by initializing my Q-table with random values, and end with a table that looks like this after 5000 iteration:
-21.9241 345.35 169.189 462.934 308.445 842.939 256.074 712.23 283.328 137.078 -32.8
398.895 968.8 574.977 488.216 468.481 948.541 904.77 159.578 237.928 29.7712 417.599
1314.25 756.426 333.321 589.25 616.682 583.632 481.84 457.585 683.22 329.132 227.329
1127.58 1457.92 1365.58 1429.26 1482.69 1574.66 1434.77 1195.64 1231.01 1232.07 1068
807.592 1070.17 544.13 1385.63 883.123 1662.97 524.08 966.205 1649.67 509.825 909.006
225.453 1141.34 536.544 242.647 1522.26 1484.47 297.704 993.186 589.984 689.73 1340.89
1295.03 310.461 361.776 399.866 663.152 334.657 497.956 229.94 294.462 311.505 1428.26
My immediate reward is based on how far each token is in the game multiplied with constant 10, after an action has been performed. Home position has position -1 and goal position has position 99. and all position in-between has positions between 0 - 55. If a Token is in goal, will a extra reward +100 be added to the immediate reward for each token in goal.
Usually, my player moves always one token to the goal... and thats it.
Upvotes: 0
Views: 332
Reputation: 798
Why I am losing so much to random players? Shouldn't the system be able to win if the Q-table isn't changing that much?
It could be a bug in your Q-learning implementation. You say that the values in the learned Q-table are what close to what you expect though. If the values are converging, then I think it's less likely to be a bug and more likely that...
Your agent is doing the best it can given the state representation.
Q-table entries converge to the optimal value for taking an action in a given state. For this "optimal policy" to actually translate to what we would call good Ludo playing, the states the agent learns on need to directly correspond to the states of the board game. Looking at your states, you can see multiple arrangements of pieces on the board that map to the same state. For instance, if you are allowing players to have multiple tokens, the state space does not represent the position of all of them (neither does the actionspace). This could be why you are observing that the agent only moves one token then stops: it can't see that it has any other actions to take, because it believes that it's done! To give another example of how this is a problem, notice that the agent may want to take different actions depending on the position of the opponent's pieces, so to play optimally, the agent needs this information too. This information needs to be included in your state representation.
You could start adding rows to the Q-table, but here's the problem you'll run into: there are too many possible states in Ludo to feasibly learn tabularly (using a Q-table). The size would be something like all of your current states, multiplied by every possible position of every other token on the board.
So to answer this question:
in general how many iterations would I have to train my agent?
With a state space that accurately represents all arrangements of the board, too many iterations to be feasible. You will need to look into defining features of states to learn on. These features will highlight important differences between states and discard others, so you can think of this as compressing the state space that the agent is learning on. Then you may also consider using a function approximator instead of a Q-table to cope with what will likely still be a very large number of features. You can read more about this in Reinforcement Learning: An Introduction, particularly around 3.9.
Upvotes: 1