Reputation: 53
I am implementing an RL agent based on A2C of stable-baseline3 on a gym environment with MultiDiscrete observation and action spaces.
I get the following error when learning
RuntimeError: Class values must be smaller than num_classes.
This is a typical PyTorch error, but I do not get its origin. I attach my code.
Before the code, I explain the idea. We train a Custom environment where we have several machines (we train first only two machines), needing to decide the production rate of the machines before they break. The action space includes also the decision of scheduling the maintenance in some time distance, and for each machine it decides which machine to be maintained.
Hence, observation space is the consumption state of each machine and the time distance of the scheduled maintenance (it can also be "not scheduled"), whereas action space is production rate for each machine, maintenance decision for each machine and call-to-schedule.
The reward is given when the total production exceeds a threshold, and negative rewards are the costs of maintenance and scheduling.
Now, I know this is a big thing and we need to reduce these spaces, but the actual problem is this error with PyTorch. I do not see where it comes from. A2C deals with both MultiDiscrete space in observation and action, but I do not know the origin of this. We set a A2C algorithm with MlpPolicy and we try to train the policy with this environment.
I attach the code.
from gym import Env
from gym.spaces import MultiDiscrete
import numpy as np
from numpy.random import poisson
import random
from functools import reduce
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Flatten
# from tensorflow.keras.optimizers import Adam
from stable_baselines3 import A2C
from stable_baselines3.common.env_checker import check_env
class MaintenanceEnv(Env):
def __init__(self, max_machine_states_vec, production_rates_vec, production_threshold, scheduling_horizon, operations_horizon = 100):
"""
Returns:
self.action_space is a vector with the maximum production rate fro each machine, a binary call-to-maintenance and a binary call-to-schedule
"""
num_machines = len(max_machine_states_vec)
assert len(max_machine_states_vec) == len(production_rates_vec), "Machine states and production rates have different cardinality"
# Actions we can take, down, stay, up
self.action_space = MultiDiscrete(production_rates_vec + num_machines*[2] + [2]) ### Action space is the production rate from 0 to N and the choice of scheduling
# Temperature array
self.observation_space = MultiDiscrete(max_machine_states_vec + [scheduling_horizon+2]) ### Observation space is the 0,...,L for each machine + the scheduling state including "ns" (None = "ns")
# Set start temp
self.state = num_machines*[0] + [0]
# Set shower length
self.operations_horizon = operations_horizon
self.time_to_finish = operations_horizon
self.scheduling_horizon = scheduling_horizon
self.max_states = max_machine_states_vec
self.production_threshold = production_threshold
def step(self, action):
"""
Notes: Schedule state
"""
num_machines = len(self.max_states)
maintenance_distance_index = -1
reward = 0
done = False
info = {}
### Cost parameters
cost_setup_schedule = 5
cost_preventive_maintenance = 10
cost_corrective_maintenance = 50
reward_excess_on_production = 5
cost_production_deficit = 10
cost_fixed_penalty = 10
failure_reward = -10**6
amount_produced = 0
### Errors
if action[maintenance_distance_index] == 1 and self.state[-1] != self.scheduling_horizon + 1: # Case when you set a reparation scheduled, but it is already scheduled. Not possible.
reward = failure_reward ###It should not be possible
done = True
return self.state, reward, done, info
if self.state[-1] == 0:
for pos in range(num_machines):
if action[num_machines + pos] == 1 and self.state[maintenance_distance_index] > 0: ### Case when maintenance is applied, but schedule is not involved yet. Not possible.
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
for pos in range(num_machines):
if self.state[pos] == self.max_states[pos] and action[pos] > 0: # Case when machine is broken, but it is producing
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
if self.state[maintenance_distance_index] == 0:
for pos in range(num_machines):
if action[num_machines+pos] == 1 and action[pos] > 0 : ### Case when it is maintenance time but the machines to be maintained keeps working. Not possible
reward = failure_reward ### It should not be possible
done = True
return self.state, reward, done, info
### State update
for pos in range(num_machines):
if self.state[pos] < self.max_states[pos] and self.state[maintenance_distance_index] > 0: ### The machine is in production, state update includes product amount
# self.state[pos] = min(self.max_states[pos] , self.state[pos] + poisson(action[pos] / self.action_space[pos])) ### Temporary: for I delete from the state the result of a poisson distribution depending on the production rate, Poisson is temporary
self.state[pos] = min(self.max_states[pos] , self.state[pos] + action[pos]) ### Temporary: Consumption rate is deterministic
amount_produced += action[pos]
if amount_produced >= self.production_threshold:
reward += reward_excess_on_production * (amount_produced - self.production_threshold)
else:
reward -= cost_production_deficit * (self.production_threshold - amount_produced)
reward -= cost_fixed_penalty
if action[maintenance_distance_index] == 1 and self.state[maintenance_distance_index] == self.scheduling_horizon + 1: ### You call a schedule when the state is not scheduled
self.state[maintenance_distance_index] = self.scheduling_horizon
reward -= cost_setup_schedule
elif self.state[maintenance_distance_index] > 0 and self.state[maintenance_distance_index] <= self.scheduling_horizon: ### You reduced the distance from scheduled maintenance
self.state[maintenance_distance_index] -= 1
for pos in range(num_machines): ### Case when we are repairing the machines and we need to pay the costs of repairment, and set them as new
if action[num_machines+pos] == 1 :
if self.state[pos] < self.max_states[pos]:
reward -= cost_preventive_maintenance
elif self.state[pos] == self.max_states[pos]:
reward -= cost_corrective_maintenance
self.state[pos] = 0
if self.state[maintenance_distance_index] == 0: ### when maintenance have been performed, reset the scheduling state to "not scheduled"
self.state[maintenance_distance_index] = self.scheduling_horizon + 1
### Time threshold
if self.time_to_finish > 0:
self.time_to_finish -= 1
else:
done = True
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset shower temperature
num_machines = len(self.max_states)
self.state = np.array(num_machines*[0] + [0])
self.time_to_finish = self.operations_horizon
return self.state
def build_model(states, actions):
model = Sequential()
model.add(Dense(24, activation='relu', input_shape=states)) #
model.add(Dense(24, activation='relu'))
model.add(Dense(actions, activation='linear'))
return model
if __name__ == "__main__":
###GLOBAL COSTANTS AND PARAMETERS
NUMBER_MACHINES = 2
FAILURE_STATE_LIMIT = 8
MAXIMUM_PRODUCTION_RATE = 5
SCHEDULING_HORIZON = 4
PRODUCTION_THRESHOLD = 20
machine_states = NUMBER_MACHINES * [4]
failure_states = NUMBER_MACHINES * [FAILURE_STATE_LIMIT]
production_rates = NUMBER_MACHINES * [MAXIMUM_PRODUCTION_RATE]
### Setting environment
env = MaintenanceEnv(failure_states, production_rates, PRODUCTION_THRESHOLD, SCHEDULING_HORIZON)
model = A2C("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
# env.render()
if done:
obs = env.reset()
I have the feeling it is due to the MultiDiscrete spaces, but I ask help. Thanks :)
Upvotes: 2
Views: 2640
Reputation: 99
As per the docs https://www.gymlibrary.dev/api/spaces/#discrete, Discrete(n) would have values from 0 to (n-1) by default. We have to note that this does not go up to n.
The MultiDiscrete space is built on top of Discrete and has the same behavior. Hence, if an action leads to a value n, we would get this error. A simple way to solve it is to ensure that the values never exceed the bounds (inclusive of 0 and up to n-1) while taking a step.
Upvotes: 1