Reputation: 1
I would like to solve the Gambler's problem as an MDP (Markov Decision Process).
Gambler's problem: A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he has staked on that flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of κ dollars, or loses by running out of money. On each flip, the gambler must decide how many (integer) dollars to stake. The probability of heads is p and that of tails is 1 − p.
I implemented the modell-free Q-learning method using a totally random base policy. But the code is not working as I hoped and I can't figure out why. Thank you for any suggestions. :)
import numpy as np
import numpy as np
import matplotlib.pyplot as plt
import random
#data
kappa=100 #goal
p=0.25 #probability of the head, winning
eps=0.1 #0.1, 0.005 epsilon
gamma=0.9 #discount factor
alpha=0.1 # 0.1, 1, 10 learning rate
n=1000 #number of training episodes
#Q-learning with totally random base policy
S = [*range(0,kappa+1)]
A = [*range(0,kappa+1)]
R=np.zeros((kappa+1,kappa+1))
for i in A:
R[kappa,i]=1
Q=np.zeros((kappa+1,kappa+1))
optimal_policy=np.zeros(kappa+1)
for sa in range(1,kappa):
i=0
while i<n:
s=sa
while True:
#choose a random action
seged=min(s,kappa-s)
a=np.random.randint(low=1,high=seged+1) #the maximum of my stake is the coins I own
#take action, observe the state
rand=random.uniform(0,1)
if rand < p: #if I win, I got more coins
s_next = s + a
else: #if I loose, I loose the stake
s_next = s - a
Q[s,a]=Q[s,a]+alpha*(R[s_next,a]+(gamma*max(Q[s_next,b] for b in range(0,s_next+1))-Q[s,a]))
if s_next==0:
break
if s_next==kappa:
i=i+1
break
s = s_next
for s in range(1,kappa+1):
optimal_policy[s]=np.argmax(Q[s,])
Q=np.round(Q,2)
print(Q)
print(optimal_policy)
x = np.array(range(0, kappa+1))
y = optimal_policy
plt.xlabel("Amount available (Current State)")
plt.ylabel('Recommended betting amount')
plt.title("Optimal policy: Random base policy (p=" + str(p)+", \u03B1=" + str(alpha)+")")
plt.scatter(x, y)
plt.show()
Upvotes: 0
Views: 409
Reputation: 5467
The problem seems to be that your while i<n
loop never terminates.
It looks like you accidentally wait until the first win before incrementing i
. (You forgot to increment i
when the episode ends with a loss.) To avoid this mistake, I suggest to write that loop as for i in range(n)
instead of incrementing i
before each break
.
This first win never happens, because when starting with 1 dollar, and a win probability of 25%, it is (in practice) impossible to win this game. This also means that your first few iterations (starting with little money) will not learn anything because they never win. The R[]
is always zero, and there is no signal in the Q[]
table yet to propagate between states.
What I did to figure this out, was simply to insert some statements like print('i:', i)
into the code.
Upvotes: 1