HWerneck
HWerneck

Reputation: 99

TF-Agents error: TypeError: The two structures do not match: Trajectory vs. Trajectory

I am building a PPO agent side by side with the TF-Agents DQN tutorial. The idea was checking the basics structures needed for a simple tf-agent to work, and adapting it to a PPO agent.

I am also using a custom environment, ViZDoom. It installs and is working properly.

I am getting an error when testing the "collect_data" function. This is the code I am running, and following it, the error I get (full code at the bottom):

data_doom_env = DoomEnvironment()
data_env = tf_py_environment.TFPyEnvironment(data_doom_env)

random_data_policy = random_tf_policy.RandomTFPolicy(data_env.time_step_spec(), data_env.action_spec())

collect_data(data_env, random_data_policy, replay_buffer, initial_collect_steps)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-d4548f4adc88> in <module>()
      5 random_data_policy = random_tf_policy.RandomTFPolicy(data_env.time_step_spec(), data_env.action_spec())
      6 
----> 7 collect_data(data_env, random_data_policy, replay_buffer, initial_collect_steps)

4 frames
/usr/local/lib/python3.7/dist-packages/tf_agents/utils/nest_utils.py in assert_same_structure(nest1, nest2, check_types, expand_composites, allow_shallow_nest1, message)
    124         lambda _: _DOT, nest2, expand_composites=expand_composites)
    125     raise exception('{}:\n  {}\nvs.\n  {}\nValues:\n  {}\nvs.\n  {}.'
--> 126                     .format(message, str1, str2, nest1, nest2))
    127 
    128 

TypeError: The two structures do not match:
  Trajectory(
{'action': .,
 'discount': .,
 'next_step_type': .,
 'observation': .,
 'policy_info': (),
 'reward': .,
 'step_type': .})
vs.
  Trajectory(
{'action': .,
 'discount': .,
 'next_step_type': .,
 'observation': .,
 'policy_info': DictWrapper({'dist_params': DictWrapper({'logits': .})}),
 'reward': .,
 'step_type': .})
Values:
  Trajectory(
{'action': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>,
 'discount': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>,
 'next_step_type': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>,
 'observation': <tf.Tensor: shape=(1, 1, 160, 260, 3), dtype=float32, numpy=
array([[[[[0.21568628, 0.21568628, 0.21568628],
          [0.2627451 , 0.2627451 , 0.2627451 ],
          [0.29411766, 0.29411766, 0.29411766],
          ...,
          [0.29411766, 0.29411766, 0.29411766],
          [0.2627451 , 0.2627451 , 0.2627451 ],
          [0.15294118, 0.15294118, 0.15294118]],

         [[0.10588235, 0.10588235, 0.10588235],
          [0.15294118, 0.15294118, 0.15294118],
          [0.21568628, 0.21568628, 0.21568628],
          ...,
          [0.15294118, 0.15294118, 0.15294118],
          [0.10588235, 0.10588235, 0.10588235],
          [0.10588235, 0.10588235, 0.10588235]],

         [[0.15294118, 0.15294118, 0.15294118],
          [0.13725491, 0.13725491, 0.13725491],
          [0.13725491, 0.13725491, 0.13725491],
          ...,
          [0.15294118, 0.15294118, 0.15294118],
          [0.15294118, 0.15294118, 0.15294118],
          [0.15294118, 0.15294118, 0.15294118]],

         ...,

         [[0.43529412, 0.34117648, 0.2627451 ],
          [0.43529412, 0.34117648, 0.2627451 ],
          [0.43529412, 0.34117648, 0.2627451 ],
          ...,
          [0.43529412, 0.34117648, 0.2627451 ],
          [0.46666667, 0.37254903, 0.29411766],
          [0.46666667, 0.37254903, 0.29411766]],

         [[0.48235294, 0.3882353 , 0.30980393],
          [0.48235294, 0.3882353 , 0.30980393],
          [0.5137255 , 0.41960785, 0.34117648],
          ...,
          [0.46666667, 0.37254903, 0.29411766],
          [0.46666667, 0.37254903, 0.29411766],
          [0.46666667, 0.37254903, 0.29411766]],

         [[0.40392157, 0.3254902 , 0.24705882],
          [0.43529412, 0.34117648, 0.2627451 ],
          [0.43529412, 0.34117648, 0.2627451 ],
          ...,
          [0.43529412, 0.34117648, 0.2627451 ],
          [0.43529412, 0.34117648, 0.2627451 ],
          [0.43529412, 0.34117648, 0.2627451 ]]]]], dtype=float32)>,
 'policy_info': (),
 'reward': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([-1.], dtype=float32)>,
 'step_type': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([0], dtype=int32)>})
vs.
  Trajectory(
{'action': BoundedTensorSpec(shape=(), dtype=tf.int32, name='selected_action', minimum=array(0, dtype=int32), maximum=array(4, dtype=int32)),
 'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'next_step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type'),
 'observation': BoundedTensorSpec(shape=(1, 160, 260, 3), dtype=tf.float32, name='screen_observation', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'policy_info': {'dist_params': {'logits': TensorSpec(shape=(5,), dtype=tf.float32, name='CategoricalProjectionNetwork_logits')}},
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')}).

I already tried what it suggested by David Braun in this StackOverflow question. It did not work (code and error next), and I don't really get why he did that to make his code work and the official tutorial from TF-Agents didn't need to do that. Anyways, the code:

#Testing the data collection function with a random policy and with David Braun's suggestion
data_doom_env = DoomEnvironment()
data_env = tf_py_environment.TFPyEnvironment(data_doom_env)

random_data_policy = random_tf_policy.RandomTFPolicy(data_env.time_step_spec(), data_env.action_spec())

collect_data(data_env, random_data_policy, replay_buffer, initial_collect_steps)

and the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-d4548f4adc88> in <module>()
      5 random_data_policy = random_tf_policy.RandomTFPolicy(data_env.time_step_spec(), data_env.action_spec())
      6 
----> 7 collect_data(data_env, random_data_policy, replay_buffer, initial_collect_steps)

4 frames
/usr/local/lib/python3.7/dist-packages/tf_agents/utils/nest_utils.py in assert_same_structure(nest1, nest2, check_types, expand_composites, allow_shallow_nest1, message)
    124         lambda _: _DOT, nest2, expand_composites=expand_composites)
    125     raise exception('{}:\n  {}\nvs.\n  {}\nValues:\n  {}\nvs.\n  {}.'
--> 126                     .format(message, str1, str2, nest1, nest2))
    127 
    128 

TypeError: The two structures do not match:
  Trajectory(
{'action': .,
 'discount': .,
 'next_step_type': .,
 'observation': .,
 'policy_info': (),
 'reward': .,
 'step_type': .})
vs.
  Trajectory(
{'action': .,
 'discount': .,
 'next_step_type': .,
 'observation': .,
 'policy_info': DictWrapper({'dist_params': DictWrapper({'logits': .})}),
 'reward': .,
 'step_type': .})
Values:
  Trajectory(
{'action': <tf.Tensor: shape=(1, 1), dtype=int32, numpy=array([[0]], dtype=int32)>,
 'discount': <tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[1.]], dtype=float32)>,
 'next_step_type': <tf.Tensor: shape=(1, 1), dtype=int32, numpy=array([[1]], dtype=int32)>,
 'observation': <tf.Tensor: shape=(1, 1, 1, 160, 260, 3), dtype=float32, numpy=
array([[[[[[0.2627451 , 0.2627451 , 0.2627451 ],
           [0.29411766, 0.29411766, 0.29411766],
           [0.2627451 , 0.2627451 , 0.2627451 ],
           ...,
           [0.2627451 , 0.2627451 , 0.2627451 ],
           [0.21568628, 0.21568628, 0.21568628],
           [0.15294118, 0.15294118, 0.15294118]],

          [[0.10588235, 0.10588235, 0.10588235],
           [0.21568628, 0.21568628, 0.21568628],
           [0.2627451 , 0.2627451 , 0.2627451 ],
           ...,
           [0.15294118, 0.15294118, 0.15294118],
           [0.10588235, 0.10588235, 0.10588235],
           [0.10588235, 0.10588235, 0.10588235]],

          [[0.13725491, 0.13725491, 0.13725491],
           [0.13725491, 0.13725491, 0.13725491],
           [0.15294118, 0.15294118, 0.15294118],
           ...,
           [0.15294118, 0.15294118, 0.15294118],
           [0.15294118, 0.15294118, 0.15294118],
           [0.15294118, 0.15294118, 0.15294118]],

          ...,

          [[0.43529412, 0.34117648, 0.2627451 ],
           [0.43529412, 0.34117648, 0.2627451 ],
           [0.43529412, 0.34117648, 0.2627451 ],
           ...,
           [0.46666667, 0.37254903, 0.29411766],
           [0.46666667, 0.37254903, 0.29411766],
           [0.46666667, 0.37254903, 0.29411766]],

          [[0.48235294, 0.3882353 , 0.30980393],
           [0.5137255 , 0.41960785, 0.34117648],
           [0.5137255 , 0.41960785, 0.34117648],
           ...,
           [0.46666667, 0.37254903, 0.29411766],
           [0.46666667, 0.37254903, 0.29411766],
           [0.46666667, 0.37254903, 0.29411766]],

          [[0.43529412, 0.34117648, 0.2627451 ],
           [0.37254903, 0.29411766, 0.21568628],
           [0.37254903, 0.29411766, 0.21568628],
           ...,
           [0.43529412, 0.34117648, 0.2627451 ],
           [0.43529412, 0.34117648, 0.2627451 ],
           [0.43529412, 0.34117648, 0.2627451 ]]]]]], dtype=float32)>,
 'policy_info': (),
 'reward': <tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[-1.]], dtype=float32)>,
 'step_type': <tf.Tensor: shape=(1, 1), dtype=int32, numpy=array([[0]], dtype=int32)>})
vs.
  Trajectory(
{'action': BoundedTensorSpec(shape=(), dtype=tf.int32, name='selected_action', minimum=array(0, dtype=int32), maximum=array(4, dtype=int32)),
 'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'next_step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type'),
 'observation': BoundedTensorSpec(shape=(1, 160, 260, 3), dtype=tf.float32, name='screen_observation', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'policy_info': {'dist_params': {'logits': TensorSpec(shape=(5,), dtype=tf.float32, name='CategoricalProjectionNetwork_logits')}},
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')}).

I noticed the observation shape is different in both Trajectories. It should be shape=(1, 160, 260, 3), but one of the observations has the shape=(1, 1, 160, 260, 3). I don't know why is that and where should I try to fix this, as I thought that was exactly what David Braun's answer was all about.

I don't really know how to proceed or what else to try, I am really stuck. Does anybody have any idea why the trajectories are displaying different structures and how do I fix it?

Full code:

from google.colab import drive
drive.mount('/content/drive')
#%%bash
# Install deps from 
# https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md#-linux
!sudo apt update
!sudo apt upgrade

!sudo apt install build-essential zlib1g-dev libsdl2-dev libjpeg-dev nasm tar libbz2-dev libgtk2.0-dev \
cmake git libfluidsynth-dev libgme-dev libopenal-dev timidity libwildmidi-dev unzip

# Boost libraries
!sudo apt install libboost-all-dev

# Lua binding dependencies
!sudo apt install liblua5.1-dev

!pip install tf-agents

!pip install git+https://github.com/mwydmuch/ViZDoom
#!pip uninstall vizdoom
#!pip install vizdoom

!sudo apt update
!sudo apt upgrade
from vizdoom import *
import numpy as np
import pandas as pd
import seaborn as sbrn

from __future__ import absolute_import, division, print_function

import tensorflow as tf
from tensorflow import keras

from tf_agents.agents.ppo import ppo_agent
from tf_agents.environments import py_environment
from tf_agents.environments import tf_py_environment

from tf_agents.specs import array_spec, BoundedArraySpec, ArraySpec
from tf_agents.networks.actor_distribution_rnn_network import ActorDistributionRnnNetwork
from tf_agents.networks.value_rnn_network import ValueRnnNetwork

from tf_agents.trajectories import trajectory, time_step
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer

import time
import random
class DoomEnvironment (py_environment.PyEnvironment):

  def __init__(self):
    super().__init__()

    self._game = self.create_environment()
    self._state = self._game.get_state()
    self._num_actions = self._game.get_available_buttons_size()

    self._screen_type = 0
    self._frame_stack_size = 1
    self._stacked_frames = np.zeros((self._frame_stack_size, 160, 260, 3), dtype=np.float32)
    self._doom_vars = [0] * 9

    self._action_spec = array_spec.BoundedArraySpec(shape=(), dtype=np.int32, minimum=0, maximum=self._num_actions - 1, name='selected_action')

    #1. Single image observation
    self._observation_spec = array_spec.BoundedArraySpec(shape=(self._frame_stack_size, 160, 260, 3), dtype=np.float32, minimum=0, maximum=1, name='screen_observation')

  def create_environment(self):
    ### New game instance
    game = DoomGame()
    
    ### Adjust configuration file path
    game.load_config('/content/drive/My Drive/ViZDoom/Config/Test_configuration.cfg')

    ### Adjust game scenario path
    game.set_doom_scenario_path('/content/drive/My Drive/ViZDoom/Maps/basic.wad')

    ### Google Colab does not support the video output of ViZDoom. The following line is needed for the environment to be run.
    game.set_window_visible(False)

    ### Adding relevant variables (reward-related variables)
    game.add_available_game_variable(GameVariable.USER1)
    game.add_available_game_variable(GameVariable.USER2)
    game.add_available_game_variable(GameVariable.USER3)
    game.add_available_game_variable(GameVariable.USER4)
    game.add_available_game_variable(GameVariable.USER5)
    game.add_available_game_variable(GameVariable.USER6)
    game.add_available_game_variable(GameVariable.USER7)
    game.add_available_game_variable(GameVariable.USER8)
    game.add_available_game_variable(GameVariable.USER9)
    game.add_available_game_variable(GameVariable.USER10)

    #Instantiate the game
    game.init()

    #return game, possible_actions
    return game

  def _reset(self):
    self._game.new_episode()
    self._stacked_frames = np.zeros((self._frame_stack_size, 160, 260, 3), dtype=np.float32)
    self._doom_vars = [0] * 9
    t_step = time_step.restart(self._create_observation())
    return t_step

  def action_spec(self):
    return self._action_spec

  def observation_spec(self):
    return self._observation_spec

  def _create_observation(self):
    screen = self.stack_frames()
    observation_spec = screen
    return observation_spec
    
  def _step(self, selected_action):
    if (self._game.is_episode_finished()):
      return self.reset()
    
    repeating_tics = 1
    
    #Creates a vector with all possible actions set to 0, then set the selected action to 1.
    action = [0] * self._num_actions
    action[selected_action] = 1
    
    #Makes action, receives reward.
    reward = self._game.make_action(action, repeating_tics)
    
    if (self._game.is_episode_finished()):
      return time_step.termination(self._create_observation(), reward)
    else:
      return time_step.transition(self._create_observation(), reward)

  def stack_frames(self):
    new_frame = self.preprocess_frame()

    if self._game.is_new_episode():
      for frame in range(self._frame_stack_size):
        self._stacked_frames[frame] = new_frame
    else:
      for frame in range((self._frame_stack_size) - 1):
        self._stacked_frames[frame] = self._stacked_frames[frame + 1]
      self._stacked_frames[self._frame_stack_size - 1] = new_frame
    return self._stacked_frames

  def preprocess_frame(self):
    """
      Preprocess frame before stacking it:
        - Region-of-interest (ROI) selected from the original frame.
          Frame is cut by 40 pixels up and down, and 30 pixels for left and right.
        - Normalize images to interval [0,1]
    """
    frame = self.get_screen_buffer_frame()
    if (self._screen_type == 0):
      frame = self.get_screen_buffer_frame()
    elif (self._screen_type == 1):
      frame = self.get_label_buffer_frame()

    roi = frame[40:-40, 30:-30]
    roi_normal = np.divide(roi, 255, dtype=np.float32)
    return roi_normal

  def get_screen_buffer_frame(self):
    """ Get the current game screen buffer or an empty screen buffer if episode is finished"""
    if (self._game.is_episode_finished()):
        return np.zeros((240, 320, 3), dtype=np.float32)
    else:
        return self._game.get_state().screen_buffer

  def get_label_buffer_frame(self):
    """ Get the current game label screen buffer or an empty screen buffer if episode is finished"""
    if (self._game.is_episode_finished()):
        return np.zeros((240, 320, 3), dtype=np.float32)
    else:
        return self._game.get_state().labels_buffer

  def render(self, mode='rgb_array'):
    """ Return game frame for rendering. """
    return self.get_screen_buffer_frame()
initial_collect_steps = 1000 # @param {type:"integer"}
max_steps_per_episode = 100 # @param {type:"integer"} #4200
number_of_episodes = 500 # @param {type:"integer"} #10000
number_of_epochs = 3 # @param {type:"integer"} #10

batch_size = 1 # @param {type:"integer"}
replay_buffer_max_size = 10000 # @param {type: "integer"}

learning_rate = 5e-4 # @param {type:"number"}
epsilon = 1e-5 # @param {type:"number"}
discount_factor = 0.995 # @param {type:"number"}
def create_networks(observation_spec, action_spec):
  actor_net = ActorDistributionRnnNetwork(
      observation_spec, action_spec,
      conv_layer_params = [(16, 8, 4), (32, 4, 2)],
      input_fc_layer_params = (256,),
      lstm_size = (256,),
      output_fc_layer_params = (128,),
      activation_fn = tf.nn.elu)
  value_net = ValueRnnNetwork(
      observation_spec,
      conv_layer_params = [(16, 8, 4), (32, 4, 2)],
      input_fc_layer_params = (256,),
      lstm_size = (256,),
      output_fc_layer_params = (128,),
      activation_fn = tf.nn.elu)
  
  return actor_net, value_net
def compute_average_return (environment, policy, number_of_episodes=10):
  total_return = 0.0
  for episode in range(number_of_episodes):
    time_step = environment.reset()
    episode_return = 0.0
    step = 0

    while not (time_step.is_last() or step >= 100):
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward #episode_return += time_step.reward
      step += 1
    total_return += episode_return

  average_return = total_return/number_of_episodes
  return average_return.numpy()[0]
##This is the most similar bit to the one that is producing the error. This part works ok and is printing out different values every time I run this code
metrics_doom_env = DoomEnvironment()
metrics_env = tf_py_environment.TFPyEnvironment(metrics_doom_env)

random_metrics_policy = random_tf_policy.RandomTFPolicy(metrics_env.time_step_spec(), metrics_env.action_spec())

compute_average_return (metrics_env, random_metrics_policy, 10)
env = DoomEnvironment()
tf_env = tf_py_environment.TFPyEnvironment(env)
evaluation_tf_env = tf_py_environment.TFPyEnvironment(env)

actor_net, value_net = create_networks(tf_env.observation_spec(), tf_env.action_spec())

global_step = tf.compat.v1.train.get_or_create_global_step()
#optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate = learning_rate, epsilon = epsilon)

agent = ppo_agent.PPOAgent(
    time_step_spec = tf_env.time_step_spec(),
    action_spec = tf_env.action_spec(),
    actor_net = actor_net,
    value_net = value_net,
    optimizer = optimizer,
    num_epochs = number_of_epochs,
    train_step_counter = global_step,
    discount_factor = discount_factor,
    gradient_clipping = 0.5,
    entropy_regularization = 1e-2,
    importance_ratio_clipping = 0.2,
    use_gae = True,
    use_td_lambda_return = True)

agent.initialize()
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=tf_env.batch_size,
    max_length=replay_buffer_max_size)
agent.collect_data_spec
#This is not code, it is the output of the previous code block! I thought it might be helpful.
_TupleWrapper(Trajectory(
{'action': BoundedTensorSpec(shape=(), dtype=tf.int32, name='selected_action', minimum=array(0, dtype=int32), maximum=array(4, dtype=int32)),
 'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'next_step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type'),
 'observation': BoundedTensorSpec(shape=(1, 160, 260, 3), dtype=tf.float32, name='screen_observation', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'policy_info': {'dist_params': {'logits': TensorSpec(shape=(5,), dtype=tf.float32, name='CategoricalProjectionNetwork_logits')}},
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')}))
def collect_step(environment, policy, buffer):
  t_step = environment.current_time_step()
  action_step = policy.action(t_step)
  next_t_step = environment.step(action_step.action)

  trajectory_input = trajectory.from_transition(t_step, action_step, next_t_step)
  buffer.add_batch(trajectory_input)
  return

def collect_data(environment, policy, buffer, steps):
  for step in range(steps):
    collect_step(environment, policy, buffer)
data_doom_env = DoomEnvironment()
data_env = tf_py_environment.TFPyEnvironment(data_doom_env)

random_data_policy = random_tf_policy.RandomTFPolicy(data_env.time_step_spec(), data_env.action_spec())

collect_data(data_env, random_data_policy, replay_buffer, initial_collect_steps)

I do not have further code (tutorial goes on), since my application breaks here.

Upvotes: 0

Views: 665

Answers (1)

David Braun
David Braun

Reputation: 820

I think the RandomTFPolicy is returning a Trajectory without 'policy_info': DictWrapper({'dist_params': DictWrapper({'logits': .})}),

Maybe you should initialize it with emit_log_probability=True :

random_data_policy = random_tf_policy.RandomTFPolicy(
    data_env.time_step_spec(), data_env.action_spec(), emit_log_probability=True)

Based on this section I have some hope it will output the policy info. Otherwise, you might need to dive into RandomTFPolicy more.

Upvotes: 1

Related Questions