Reputation: 55
I want to compute the efficient return for the cumulative rewards in Reinforcement learning. In the code below I get either 0 as result or an empty list. Could someone help me out what I am missing?
import numpy as np
def discounted_return(rewards: np.ndarray, gamma: float) -> np.ndarray:
'''
Computes all returns for the given sequence of rewards.
:param rewards: The sequence of rewards as a np.ndarray.
:param gamma: The discount factor as a `float`.
:returns: The discounted return for each time step; as a `np.ndarray`.
'''
T = len(rewards)
returns = np.zeros_like(T, dtype=float)
# implement the efficient return computation
# YOUR CODE HERE
R = 0
for t in reversed(range(1, T-1)):
# update the total discounted reward
R = R * gamma + rewards[t]
returns[t] = R
return returns
Error that is thrown:
print(discounted_return(np.array([0, 1]), 0.9))
assert np.isclose(np.array([1.0, 0]), discounted_return(np.array([0, 1]), 0.9)).all()# , "`discounted_return` is not implemented or wrong."
Upvotes: 0
Views: 554
Reputation: 1332
The error you are getting means that your implementation of discounted_return
is wrong, and is thrown by what I assume is some code written by your teacher to test that your implementation is correct.
If you look at what returns
looks like when you create it, you will see that it will always be array(0.)
. Indeed, np.zeros_like
will take an array, and return an array of the same size. You're giving it T
which is a scalar, so the output will be an array of dimension 0 (a scalar).
What you want to do is
returns = np.zeros_like(rewards, dtype=float)
for instance.
Also, in the case of T=2
, you will see that reversed(range(1, T-1))
is empty, since list(range(1,1)
is []
, so your loop is iterating 0 times.
Upvotes: 1