What kind of reward should I set in q-learning to get values closer to the result I expect?

Question

I'm working on a Q-learning project using OpenAI Gym and PyBullet drones. My goal is to control the height of the drone so that it stays at a height of 1 and remains stable at that point. I'm using discrete actions 0, 1, and 2, which correspond to setting the drone's altitude to [0 0 0 0], [1 1 1 1], and [-1 -1 -1 -1] respectively. Initially, I tried setting the reward as 'reward = (1 - (next_state))**2', but I noticed that the reward and the drone's altitude were inversely proportional, meaning that as the drone descended, the reward increased. When I didn't add any reward function, the drone stayed at a height of 1.5 instead of 1.

the system reward function:

'def _computeReward(self):
    state = self._getDroneStateVector(0)
    ret = max(0, 2 - np.linalg.norm(self.TARGET_POS(state[0:3]))**4)
    return ret'

[enter image description here](https://i.sstatic.net/AB4xE48J.png)

here my get_action() function:

`def get_action(q_values, epsilon):        
    if random.random() > epsilon:
        act = np.argmax(q_values.numpy()[0])
        return act
    else:
        act = random.choice(np.arange(3))
        return act`

and my training loops:

`for i in range(num_episodes):
    state = env.reset()
    state = state[0][0][2]   
    total_points = 0 

    for t in range(max_num_timesteps):
        state_qn = np.expand_dims(state, axis=0)  
        q_values = q_network(state_qn)

        action = utils.get_action(q_values, epsilon)  

        a = np.array([[-1, -1, -1, -1]])
        action = action + a
        action = action.reshape(1, -1)

        next_state, reward, done, info, _ = env.step(action)
        next_state = next_state[0][2]
        #reward = (1 - (next_state))**2

        memory_buffer.append(experience(state, action, reward, next_state, done))
        update = utils.check_update_conditions(t, NUM_STEPS_FOR_UPDATE, memory_buffer)

        if update:
            experiences = utils.get_experiences(memory_buffer)
            agent_learn(experiences, GAMMA)

        state = next_state.copy()
        total_points += reward`

What kind of reward function can I define to correct this situation? Or what can I do?

What kind of reward should I set in q-learning to get values closer to the result I expect?

Answers (0)

Related Questions