efficient cumulative reward (return) in RL in python

Question

I want to compute the efficient return for the cumulative rewards in Reinforcement learning. In the code below I get either 0 as result or an empty list. Could someone help me out what I am missing?

import numpy as np
def discounted_return(rewards: np.ndarray, gamma: float) -> np.ndarray:
    '''
    Computes all returns for the given sequence of rewards.
    :param rewards: The sequence of rewards as a np.ndarray.
    :param gamma: The discount factor as a `float`.
    :returns: The discounted return for each time step; as a `np.ndarray`.
    ''' 
    T = len(rewards)
    
    returns = np.zeros_like(T, dtype=float)
    
    # implement the efficient return computation
    # YOUR CODE HERE
    
    R = 0
    for t in reversed(range(1, T-1)):
        # update the total discounted reward
        R = R * gamma + rewards[t]
        returns[t] = R
    
    return returns

Error that is thrown:

print(discounted_return(np.array([0, 1]), 0.9))
assert np.isclose(np.array([1.0, 0]), discounted_return(np.array([0, 1]), 0.9)).all()# , "`discounted_return` is not implemented or wrong."

efficient cumulative reward (return) in RL in python

Answers (1)

Related Questions