K-Arms Bandit Epsilon-Greedy Policy

Question

I have been trying to implement Reinforcement Learning books exercise 2.5

I have written this piece of code according to this pseudo version

class k_arm:
    def __init__(self, iter, method="incrementally"):

        # self.iter placeholder
        self.iter = iter
        self.k = 10
        self.eps = .1
        
        # here is Q(a) and N(a)
        self.qStar = np.zeros(self.k)
        self.n = np.zeros(self.k)
        
        # Method just for experimenting different functions
        self.method = method
        
    def pull(self):
        
        # selecting argmax(Q(A)) action with prob. (1 - eps)
        eps = np.random.uniform(0, 1, 1)
        if eps < self.eps or self.qStar.argmax() == 0:
            a = np.random.randint(10)
        else: a = self.qStar.argmax()
        
        # R bandit(A)
        r = np.random.normal(0, 0.01, 1)
        
        # N(A) <- N(A) + 1
        self.n[a] += 1
        
        # Q(A) <- Q(A) i / (N(A)) * (R - Q(A))
        if self.method == "incrementally":
            self.qStar[a] +=  (r - self.qStar[a]) / self.n[a] 
            return self.qStar[a]`

iter = 1000
rewards = np.zeros(iter)
c = k_arm(iter, method="incrementally")

for i in range(iter):    
    k = c.pull()
    rewards[i] = k

And I get this as a result

Where I am expecting this kind of results.

I have been trying to understand where am I went missing here, but I couldn't.

tnfru · Accepted Answer

Your average reward is around 0 because it is the correct estimation. Your reward function is defined as:

 # R bandit(A)
 r = np.random.normal(0, 0.01, 1)

This means the expected value of your reward distribution is 0 with 0.01 variance. In the book the authors use a different reward function. While this still has a fundamental issue, you could earn similar rewards if you change your code to

 # R bandit(A)
 r = np.random.normal(1.25, 0.01, 1)

It makes sense to give each bandit a different reward function or all your action values will be the same. So what you really should do is sample from k different distributions with different expected values. Otherwise action selection is meaningless. Add this to your init function:

self.expected_vals = np.random.uniform(0, 2, self.k)

and change the the calculation of the reward so, that it depends on the action:

r = np.random.uniform(self.expected_vals[a], 0.5, 1)

I've also increased the variance to 0.5 as 0.01 is basically meaningless variance in the context of bandits. If your agents works correctly, his average reward should be approximately equal to np.max(self.expected_vals)

K-Arms Bandit Epsilon-Greedy Policy

Answers (1)

Related Questions