Reputation: 61
I have been trying to implement Reinforcement Learning books exercise 2.5
I have written this piece of code according to this pseudo version
class k_arm:
def __init__(self, iter, method="incrementally"):
# self.iter placeholder
self.iter = iter
self.k = 10
self.eps = .1
# here is Q(a) and N(a)
self.qStar = np.zeros(self.k)
self.n = np.zeros(self.k)
# Method just for experimenting different functions
self.method = method
def pull(self):
# selecting argmax(Q(A)) action with prob. (1 - eps)
eps = np.random.uniform(0, 1, 1)
if eps < self.eps or self.qStar.argmax() == 0:
a = np.random.randint(10)
else: a = self.qStar.argmax()
# R bandit(A)
r = np.random.normal(0, 0.01, 1)
# N(A) <- N(A) + 1
self.n[a] += 1
# Q(A) <- Q(A) i / (N(A)) * (R - Q(A))
if self.method == "incrementally":
self.qStar[a] += (r - self.qStar[a]) / self.n[a]
return self.qStar[a]`
iter = 1000
rewards = np.zeros(iter)
c = k_arm(iter, method="incrementally")
for i in range(iter):
k = c.pull()
rewards[i] = k
And I get this as a result
Where I am expecting this kind of results.
I have been trying to understand where am I went missing here, but I couldn't.
Upvotes: 4
Views: 760
Reputation: 354
Your average reward is around 0 because it is the correct estimation. Your reward function is defined as:
# R bandit(A)
r = np.random.normal(0, 0.01, 1)
This means the expected value of your reward distribution is 0 with 0.01 variance. In the book the authors use a different reward function. While this still has a fundamental issue, you could earn similar rewards if you change your code to
# R bandit(A)
r = np.random.normal(1.25, 0.01, 1)
It makes sense to give each bandit a different reward function or all your action values will be the same. So what you really should do is sample from k
different distributions with different expected values. Otherwise action selection is meaningless.
Add this to your init
function:
self.expected_vals = np.random.uniform(0, 2, self.k)
and change the the calculation of the reward so, that it depends on the action:
r = np.random.uniform(self.expected_vals[a], 0.5, 1)
I've also increased the variance to 0.5 as 0.01 is basically meaningless variance in the context of bandits. If your agents works correctly, his average reward should be approximately equal to np.max(self.expected_vals)
Upvotes: 2