Reputation: 11
I am working on implementing a contextual bandit with Vowpal Wabbit for dynamic pricing where arms represent price margins. The cost/reward is determined by taking price – expected cost. Cost is not known initially so it is a prediction and has the potential to change. My question is, if your cost/reward can change over time can you update the cost/reward to reflect the realized cost and retrain the model?
Below is an example with a training set with 1 feature (user) and a test set. The cost was based on the expected net revenue. The model is trained and used to predict which action to take for the customers in the test set.
import pandas as pd
import sklearn as sk
import numpy as np
from vowpalwabbit import pyvw
train_data = [{'action': 1, 'cost': -150, 'probability': 0.4, 'user': 'a'},
{'action': 3, 'cost': 0, 'probability': 0.2, 'user': 'b'},
{'action': 4, 'cost': -250, 'probability': 0.5, 'user': 'c'},
{'action': 2, 'cost': 0, 'probability': 0.3, 'user': 'a'},
{'action': 3, 'cost': 0, 'probability': 0.7, 'user': 'a'}]
train_df = pd.DataFrame(train_data)
# Add index to data frame
train_df['index'] = range(1, len(train_df) + 1)
train_df = train_df.set_index("index")
# Test data
test_data = [{'user': 'b'},
{'user': 'a'},
{'user': 'b'},
{'user': 'c'}]
test_df = pd.DataFrame(test_data)
# Add index to data frame
test_df['index'] = range(1, len(test_df) + 1)
test_df = test_df.set_index("index")
# Create python model and learn from each trained example
vw = pyvw.vw("--cb 4")
for i in train_df.index:
action = train_df.loc[i, "action"]
cost = train_df.loc[i, "cost"]
probability = train_df.loc[i, "probability"]
user = train_df.loc[i, "user"]
# Construct the example in the required vw format.
learn_example = str(action) + ":" + str(cost) + ":" + str(probability) + " | " + str(user)
# Here we do the actual learning.
vw.learn(learn_example)
# Predict actions
for j in test_df.index:
user = test_df.loc[j, "user"]
test_example = "| " + str(user)
choice = vw.predict(test_example)
print(j, choice)
However, after a week we received new information and the cost was higher than expected for index 0 in the training set and lower than expected at index 2. Can this new information be used to retrain the model and predict actions?
## Reward/cost changed after 1 week once cost was realized
train_data = [{'action': 1, 'cost': 200, 'probability': 0.4, 'user': 'a'}, # Lost money
{'action': 3, 'cost': 0, 'probability': 0.2, 'user': 'b'},
{'action': 4, 'cost': -350, 'probability': 0.5, 'user': 'c'}, # Made more than exp.
{'action': 2, 'cost': 0, 'probability': 0.3, 'user': 'a'},
{'action': 3, 'cost': 0, 'probability': 0.7, 'user': 'a'}]
Upvotes: 1
Views: 356
Reputation: 821
Yes, I don't see why changing the reward over time would be a problem. This is certainly how the real world works too. Actions may be less or more appropriate in a changing world. Contextual bandits work well in a non-stationary environment, so it should be fine.
One thing to note though is that if your environment is non-stationary you probably want to provide the --power_t
option as 0
. By default, VW's learning rate decays over time (t
) as if your problem stationary you would want to converge on a solution.
Upvotes: 2