Reputation: 2355
I have started learning reinforcement learning and referring the book by Sutton. I was trying to understand the non-stationary environment which was quoted in the book as:
suppose the bandit task were nonstationary, that is, that the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy on
This tells me that the true expected rewards value given an action changes with time. But does this mean with every time step ? I could clearly understand that how we track the rewards in such case i.e. by weighting the recent ones more than previous ones for every time step. However, does this also mean or give indication that target or true values change with every time step? I am trying to simulate the 10arm bandit problem with the same fig given below where we compare Upper Confidence-Bound Action Selection and Epsilon-greedy methods with sample average method for estimating action values in stationary environment.
if I have to simulate the same with non_stationary environment then how can I do that ? Below is my code :
class NArmedBandit:
#10-armed bandit testbed with sample averages
def __init__(self,k=10,step_size = 0.1,eps = 0,UCB_c = None, sample_avg_flag = False,
init_estimates = 0.0,mu = 0, std_dev = 1):
self.k = k
self.step_size = step_size
self.eps = eps
self.init_estimates = init_estimates
self.mu = mu
self.std_dev = std_dev
self.actions = np.zeros(k)
self.true_reward = 0.0
self.UCB_c = UCB_c
self.sample_avg_flag = sample_avg_flag
self.re_init()
def re_init(self):
#true values of rewards for each action
self.actions = np.random.normal(self.mu,self.std_dev,self.k)
# estimation for each action
self.Q_t = np.zeros(self.k) + self.init_estimates
# num of chosen times for each action
self.N_t = np.zeros(self.k)
#best action chosen
self.optim_action = np.argmax(self.actions)
self.time_step = 0
def act(self):
val = np.random.rand()
if val < self.eps:
action = np.random.choice(np.arange(self.k))
#print('action 1:',action)
elif self.UCB_c is not None:
#1e-5 is added so as to avoid division by zero
ucb_estimates = self.Q_t + self.UCB_c * np.sqrt(np.log(self.time_step + 1) / (self.N_t + 1e-5))
A_t = np.max(ucb_estimates)
action = np.random.choice(np.where(ucb_estimates == A_t)[0])
else:
A_t = np.max(self.Q_t)
action = np.random.choice(np.where(self.Q_t == A_t)[0])
#print('action 2:',action)
return action
def step(self,action):
# generating the reward under N(real reward, 1)
reward = np.random.randn() + self.actions[action]
self.time_step += 1
self.N_t[action] += 1
# estimation with sample averages
if self.sample_avg_flag == True:
self.Q_t[action] += (reward - self.Q_t[action]) / self.N_t[action]
else:
# non-staationary with constant step size
self.Q_t[action] += self.step_size * (reward - self.Q_t[action])
return reward
def play(self,tasks,num_time_steps):
rewards = np.zeros((tasks, num_time_steps))
optim_action_counts = np.zeros(rewards.shape)
for task in trange(tasks):
self.re_init()
for t in range(num_time_steps):
action = self.act()
reward = self.step(action)
rewards[task, t] = reward
if action == self.optim_action:
optim_action_counts[task, t] = 1
avg_optim_action_counts = optim_action_counts.mean(axis=0)
avg_rewards = rewards.mean(axis=0)
return avg_optim_action_counts, avg_rewards
Should I change the actions array
(which are assumed as true estimates) defined in re_init()
function by calling re_init()
function after every time step in play()
which is like changing true expected rewards for every action at each time step. I have already incorporated the code for calculating the rewards in case of non-stationary environment in act()
and step()
functions which are using constant step size alpha = 0.1
. The only thing that I don't know is how do set up or simulate the non-stationary environment here and if it is correctly understood by me.
Upvotes: 2
Views: 414
Reputation: 14701
You understand correctly about non-stationary. As you understand "that the true values of the actions changed over time."
But, how they change?
Actually, it is not clearly defined. Your re_init approach is correct from my perspective. What you need to decide when they change. But one thing is clear, if you change rewards every time step, there is nothing to be learned since you are changing all to-be-learned rewards every step. I can offer two solutions for satisfying non-stationary definition.
you call re_init with small probability of eps for every 100 or 1000 steps.
you can start with initial values and add small random +/- values to your initial values. Then your rewards will drift in positive or negative direction.
Upvotes: 2