a_guest
a_guest

Reputation: 36309

Continuous DDPG doesn't seem to converge on a two-dimensional spatial search problem ("Hunt the Thimble")

I attempted to use continuous action-space DDPG in order to solve the following control problem. The goal is to walk towards an initially unknown position within a bordered, two-dimensional area by being told how far one is from the target position at each step (similar to this children's game where the player is guided by "temperature" levels, hot and cold).

In the setup the target position is fixed while the agent's starting position is varied from episode to episode. The goal is to learn a policy for walking as quickly as possible towards the target position. The agent's observation consists just of its current position. Concerning the reward design I considered the Reacher environment, since it involves a similar goal, and similarly use a control reward and a distance reward (see code below). That is getting closer to the target yields a greater reward and the closer the agent gets the more it should favor smaller actions.

For the implementation I considered the openai/spinningup package. Concerning the network architecture I figured that, if the target position was known, the optimal action would be action = target - position, i.e. the policy pi(x) -> a could be modeled as just a single dense layer and the target position would be learned in form of the bias term: a = W @ x + b where, after convergence (ideally), W = -np.eye(2) and b = target. Since the environment imposes an action limit such that the target position likely cannot be reached in a single step, I manually scale the computed actions as a = a / tf.norm(a) * action_limit. This preserves the direction towards the target and hence resembles still the optimal action. I used this custom architecture for the policy network as well as a standard MLP architecture with 3 hidden layers (see code and results below).

Results

After having run the algorithm for about 400 episodes in the MLP case and 700 episodes in the custom policy case, with 1000 steps per episode, it didn't seem to heave learnt anything useful. During the test runs the average return didn't increase and when I checked the behavior on three different starting positions it always walks towards the (0, 1) corner of the area; even when it starts right next to the target position it walks past it, heading for the (0, 1) corner. What I noticed is that the custom policy architecture agent resulted in much smaller std. dev. of the test episode returns.

Question

I'd like to understand why the algorithm doesn't seem learn anything for the given setup and what needs to be changed in order to have it converge. I suspect a problem with the implementation or with the choice of hyper-parameters, as I can't spot any conceptual problems with learning a policy in the given setup. However I couldn't pinpoint the source of the problem, so I'd be happy if someone can help.


Average test return (custom policy architecture):

Average test return (custom policy architecture)

(vertical bars indicate std. dev. of test episode returns)

Average test return (MLP policy architecture):

Average test return (MLP policy architecture)

Test cases (custom policy architecture):

Test cases (custom policy architecture)

Test cases (MLP policy architecture):

Test cases (MLP policy architecture)

Code

import logging
import os
import gym
from gym.wrappers.time_limit import TimeLimit
import numpy as np
from spinup.algos.ddpg.ddpg import core, ddpg
import tensorflow as tf


class TestEnv(gym.Env):
    target = np.array([0.7, 0.8])
    action_limit = 0.01
    observation_space = gym.spaces.Box(low=np.zeros(2), high=np.ones(2), dtype=np.float32)
    action_space = gym.spaces.Box(-action_limit * np.ones(2), action_limit * np.ones(2), dtype=np.float32)

    def __init__(self):
        super().__init__()
        self.pos = np.empty(2, dtype=np.float32)
        self.reset()

    def step(self, action):
        self.pos += action
        self.pos = np.clip(self.pos, self.observation_space.low, self.observation_space.high)
        reward_ctrl = -np.square(action).sum() / self.action_limit**2
        reward_dist = -np.linalg.norm(self.pos - self.target)
        reward = reward_ctrl + reward_dist
        done = abs(reward_dist) < 1e-9
        logging.debug('Observation: %s', self.pos)
        logging.debug('Reward: %.6f (reward (ctrl): %.6f, reward (dist): %.6f)', reward, reward_ctrl, reward_dist)
        return self.pos, reward, done, {}

    def reset(self):
        self.pos[:] = np.random.uniform(self.observation_space.low, self.observation_space.high, size=2)
        logging.info(f'[Reset] New position: {self.pos}')
        return self.pos

    def render(self, *args, **kwargs):
        pass


def mlp_actor_critic(x, a, hidden_sizes, activation=tf.nn.relu, action_space=None):
    act_dim = a.shape.as_list()[-1]
    act_limit = action_space.high[0]
    with tf.variable_scope('pi'):
        # pi = core.mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation=None)  # The standard way.
        pi = tf.layers.dense(x, act_dim, use_bias=True)  # Target position should be learned via the bias term.
        pi = pi / (tf.norm(pi) + 1e-9) * act_limit  # Prevent division by zero.
    with tf.variable_scope('q'):
        q = tf.squeeze(core.mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
    with tf.variable_scope('q', reuse=True):
        q_pi = tf.squeeze(core.mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, q, q_pi


if __name__ == '__main__':
    log_dir = 'spinup-ddpg'
    if not os.path.exists(log_dir):
        os.mkdir(log_dir)
    logging.basicConfig(level=logging.INFO)
    ep_length = 1000
    ddpg(
        lambda: TimeLimit(TestEnv(), ep_length),
        mlp_actor_critic,
        ac_kwargs=dict(hidden_sizes=(64, 64, 64)),
        steps_per_epoch=ep_length,
        epochs=1_000,
        replay_size=1_000_000,
        start_steps=10_000,
        act_noise=TestEnv.action_limit/2,
        gamma=0.99,  # Use large gamma, because of action limit it matters where we walk to early in the episode.
        polyak=0.995,
        max_ep_len=ep_length,
        save_freq=10,
        logger_kwargs=dict(output_dir=log_dir)
    )

Upvotes: 4

Views: 1694

Answers (1)

Simon
Simon

Reputation: 5422

You are using a HUGE network (64x64x64) for a very small problem. That alone can be a big issue. You are also keeping 1M samples in your memory and, again, for a very simple problem this may be detrimental and slow convergence. Try with a much simpler setup first (32x32 net and 100,000 memory, or even linear approximator with polynomial features). Also, how are you updating your target network? What is polyak? Finally, normalizing the action like that may not be a good idea. Better to just clip it or use a tanh layer at the end.

Upvotes: 1

Related Questions