Himansu Odedra
Himansu Odedra

Reputation: 81

Mutli-armed bandit: why do we increase reward by 1 when the random probability is less than probability of success assigned to the bandit

I am trying to understand the multi-armed bandit problem using python. I keep coming across pieces of code that return values of 1 (i.e. rewards) when a random probability is less than the success probability of the assigned bandit. please see code below

def reward(prob):
    reward = 0;
    for i in range(10):
        if random.random() < prob:
            reward += 1
    return reward

I have got this from the following link: http://outlace.com/rlpart1.html

Also I have seen something similar on another github page. Based on the first link what is the intuition behind the reward function (how is it similar to that of an actual one armed bandit) and finally why do we assign a reward of 1 when it is less than the probability. surely it is supposed to be the opposite unless I am mistaken. Thank you.

Upvotes: 0

Views: 365

Answers (2)

Areza
Areza

Reputation: 6080

  1. the probability is just a switch between exploration and exploitation, meaning you can set how often you would like to explore and how often to exploit. The implementation is one of the simplest algorithm (epsilon) and in more advanced version one can dynamically change this ratio or leverage other algorithm

  2. whether it should be less or more doesn't matter ! what I mean, is the math is the same, you can implement one way or the other way similar to 1.

  3. again the fact reward is 1 is arbitrary and one of the simple choices. It is easy because you can calculate later on how much rewards you have got and e.g. in the marketing case, if you had 100000 ad campaigns, you can easily calculate the success rate. In more advanced version, the reward can be a function and you can make it more complicated, again in the same marketing campaign you can set the embed the price and cost into the reward therefore, it wont be simply 1 and instead some continue value.

Upvotes: 1

StayLearning
StayLearning

Reputation: 691

This reward function will not exist if you have actual data on which arm was selected and success label.

My understanding is that you are doing this because you do not have actual data response data. In other words, you show an arm, you do not know whether they has lead to success (1) or not (0).

So you just assume, if the prob is say 0.7, 70% of the chance you will get 1. Like a Bernoulli variable with success probability of 0.7. This random.random() is just for you to implement. The larger the prob (the success probability of an arm), the larger than chance that you should get a reward.

Upvotes: 1

Related Questions