Gunners
Gunners

Reputation: 55

efficient cumulative reward (return) in RL in python

I want to compute the efficient return for the cumulative rewards in Reinforcement learning. In the code below I get either 0 as result or an empty list. Could someone help me out what I am missing?

import numpy as np
def discounted_return(rewards: np.ndarray, gamma: float) -> np.ndarray:
    '''
    Computes all returns for the given sequence of rewards.
    :param rewards: The sequence of rewards as a np.ndarray.
    :param gamma: The discount factor as a `float`.
    :returns: The discounted return for each time step; as a `np.ndarray`.
    ''' 
    T = len(rewards)
    
    returns = np.zeros_like(T, dtype=float)
    
    # implement the efficient return computation
    # YOUR CODE HERE
    
    R = 0
    for t in reversed(range(1, T-1)):
        # update the total discounted reward
        R = R * gamma + rewards[t]
        returns[t] = R
    
    return returns

Error that is thrown:

print(discounted_return(np.array([0, 1]), 0.9))
assert np.isclose(np.array([1.0, 0]), discounted_return(np.array([0, 1]), 0.9)).all()# , "`discounted_return` is not implemented or wrong."

Upvotes: 0

Views: 554

Answers (1)

Florent Monin
Florent Monin

Reputation: 1332

The error you are getting means that your implementation of discounted_return is wrong, and is thrown by what I assume is some code written by your teacher to test that your implementation is correct.

If you look at what returns looks like when you create it, you will see that it will always be array(0.). Indeed, np.zeros_like will take an array, and return an array of the same size. You're giving it T which is a scalar, so the output will be an array of dimension 0 (a scalar).

What you want to do is

returns = np.zeros_like(rewards, dtype=float)

for instance. Also, in the case of T=2, you will see that reversed(range(1, T-1)) is empty, since list(range(1,1) is [], so your loop is iterating 0 times.

Upvotes: 1

Related Questions