Can contextual bandit rewards be changed over time?

Question

I am working on implementing a contextual bandit with Vowpal Wabbit for dynamic pricing where arms represent price margins. The cost/reward is determined by taking price – expected cost. Cost is not known initially so it is a prediction and has the potential to change. My question is, if your cost/reward can change over time can you update the cost/reward to reflect the realized cost and retrain the model?

Below is an example with a training set with 1 feature (user) and a test set. The cost was based on the expected net revenue. The model is trained and used to predict which action to take for the customers in the test set.

import pandas as pd
import sklearn as sk
import numpy as np
from vowpalwabbit import pyvw

train_data = [{'action': 1, 'cost': -150, 'probability': 0.4, 'user': 'a'},
              {'action': 3, 'cost': 0, 'probability': 0.2, 'user': 'b'},
              {'action': 4, 'cost': -250, 'probability': 0.5, 'user': 'c'},
              {'action': 2, 'cost': 0, 'probability': 0.3, 'user': 'a'},
              {'action': 3, 'cost': 0, 'probability': 0.7, 'user': 'a'}]

train_df = pd.DataFrame(train_data)

# Add index to data frame
train_df['index'] = range(1, len(train_df) + 1)
train_df = train_df.set_index("index")

# Test data
test_data = [{'user': 'b'},
            {'user': 'a'},
            {'user': 'b'},
            {'user': 'c'}]

test_df = pd.DataFrame(test_data)

# Add index to data frame
test_df['index'] = range(1, len(test_df) + 1)
test_df = test_df.set_index("index")

# Create python model and learn from each trained example
vw = pyvw.vw("--cb 4")

for i in train_df.index:
  action = train_df.loc[i, "action"]
  cost = train_df.loc[i, "cost"]
  probability = train_df.loc[i, "probability"]
  user = train_df.loc[i, "user"]

  # Construct the example in the required vw format.
  learn_example = str(action) + ":" + str(cost) + ":" + str(probability) + " | " + str(user) 

  # Here we do the actual learning.
  vw.learn(learn_example)
  
# Predict actions
for j in test_df.index:
  user = test_df.loc[j, "user"]

  test_example = "| " + str(user)

  choice = vw.predict(test_example)
  print(j, choice)

However, after a week we received new information and the cost was higher than expected for index 0 in the training set and lower than expected at index 2. Can this new information be used to retrain the model and predict actions?

## Reward/cost changed after 1 week once cost was realized
train_data = [{'action': 1, 'cost': 200, 'probability': 0.4, 'user': 'a'}, # Lost money
              {'action': 3, 'cost': 0, 'probability': 0.2, 'user': 'b'},
              {'action': 4, 'cost': -350, 'probability': 0.5, 'user': 'c'}, # Made more than exp.
              {'action': 2, 'cost': 0, 'probability': 0.3, 'user': 'a'},
              {'action': 3, 'cost': 0, 'probability': 0.7, 'user': 'a'}]

Can contextual bandit rewards be changed over time?

Answers (1)

Related Questions