Reputation: 112
I am using Vowpal Wabbit's contextual bandit to rank various action given a context.
Train Data:
"1:10:0.1 | 123"
"2:9:0.1 | 123"
"3:8:0.1 | 123"
"4:7:0.1 | 123"
"5:6:0.1 | 123"
"6:5:0.1 | 123"
"7:4:0.1 | 123"
Test Data:
" | 123"
Now, the expected ranking of action should be (from least loss to most loss):
7 6 5 4 3 2 1
Using --cb
just returns the most optimal action:
7
And using --cb_explore
returns a pdf of the actions to be explored but it doesn't seem to help in ranking.
[0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.0071428571827709675, 0.9571428298950195]
Is there any other way of using vw's contextual bandit for ranking?
Upvotes: 2
Views: 1168
Reputation: 519
I wouldn't use the PMF to rank the actions, since the PMF does not correspond to the expected reward of each action given the context (unlike in the traditional multi-armed bandit setting e.g. using Thompson Sampling, where it does).
A good way of doing what you want is to sample multiple actions (without replacement) from the action set, which is what the CCB submodule does (Jack's answer).
I wrote a tutorial and code for how to implement this (with simulated rewards) that could be helpful for analyzing how the PMF's update and how the model does for your specified reward distribution and action set.
Upvotes: 1
Reputation: 821
Olga's response on the repo: https://github.com/VowpalWabbit/vowpal_wabbit/issues/2555
--cb does not do any exploration and just trains the model given the input so the output will be what the model (that has been trained so far) predicted
--cb_explore includes exploration using epsilon-greedy by default if nothing else is specified. You can take a look at all the available exploration methods here
cb_explore's output is the PMF given by the exploration strategy (see here for more info).
Epsilon-greedy will choose, with probability e, an action at random from a uniform distribution (exploration), and with probability 1-e epsilon-greedy will use the so-far trained model to predict the best action (exploitation).
So the output will be the pmf over the actions (prob. 1-e OR e for the chosen action) and then the remaining probability will be equally split between the remaining actions. Therefore cb_explore will not provide you with a ranking.
One option for ranking would be to use CCB. Then you get a ranking and can provide feedback on any slot, but it is more computationally expensive. CCB runs CB for each slot, but the effect is a ranking since each slot draws from the overall pool of actions.
And my follow up:
I think CCB is a good option if computational limits allow. I'd just like to add that if you do cb_explore or cb_explore_adf then the resulting PMF should be sorted by score so it is a ranking of sorts. However, it's worth verifying that the ordering is in fact sorted by scores (--audit will help here) as I don't know if there is a test covering this.
Upvotes: 2