Roelant
Roelant

Reputation: 5119

Calculating AUC per group in tensorflow 2.0

We have a simple dataset of users U & of items I & binary outcomes Y. The dataset is big (100K users, 10M items, 1.5B interactions) and chronologically ordered. We are training some model, let's say a simple MF model, that gives us prediction f(U, I) = Yhat.

When training is finished, we want to have the area under the curve pér item. So we want to have a mapping of item i to AUC. Now using a mapper {i: tf.keras.metrics.AUC} and masking the answers of each batch gives us memory errors. The combined AUC objects (one for each item) are too big.

What does work is saving another dataset per item and predicting that, saving the AUC, etc. However, we would prefer to not create two datasets. Any suggestions of how we could approach something like this?

Upvotes: 2

Views: 718

Answers (1)

Yaoshiang
Yaoshiang

Reputation: 1941

I've run into this problem myself. You want to do a groupby operation on the metrics against information in x, but that information is not in y or y_hat.

Metrics are like losses in that they only get to see y_true and y_hat. Since the information about U and I are not provided to the loss/metric, you wouldn't have enough information to build a custom metric that does this groupby.

A hard way to solve this would be to build a model around your real model that serializes y_hat, U, and I. Then your custom metric can deserialize y_hat, U, and I, and store the information in a grouped way. If the AUC per item is defined as the average of the AUC per interaction, then this is compact enough to stick into memory. If not, then you may need to store information to disk in your custom metric. I'd recommend using gdbm, which has an easy interface in python.



def serialize(u, i, y):
  return tf.concat(
      [tf.reshape(u, [-1]),
       tf.reshape(i, [-1]),
       tf.reshape(y, [-1])])


def deserialize(s):
  u = tf.reshape(s[:xyz], [..., ..., ...])
  i = tf.reshape(s[xyz, abc], [..., ..., ...])
  y = tf.reshape(s[abc:], [..., ..., ...])

  return u, i, y

def AUCPerItem(tf.keras.metrics.Metric):
  def __init__(self):
    self.auc_per_item = {}

  def update(y_true, y_pred):
    for serialized_example in y_pred:
      u, i, y = deserialize(serialized_example)
      # do calcualtions and store in self.auc_per_item


# Model takes U and I as inputs, and outputs y_hat.
model = get_and_train_model()

input_u = tf.keras.Input(...)
input_i = tf.keras.Input(...)
y_hat = model(input_u, input_i)
output = serialize(input_u, input_i, y_hat)

wrapper = tf.keras.Model([input_u, input_i], output)

Upvotes: 4

Related Questions