viji
viji

Reputation: 477

Does XGBoost's scale_pos_weight correctly balance the positive samples if the training dataset has more positive than negative samples?

After researching, I realized that scale_pos_weight is typically calculated as the ratio of the number of negative samples to the number of positive samples in the training data. My dataset has 840 negative samples and 2650 positive samples, so the ratio is 0.32. If my samples were the other way around, I believe scale_pos_weight would be a better approach.

Is it safe to assume that since the ratio is less than 1, it will still balance the classes correctly? Specificity is important in my study, but our primary goal focuses on recall, precision, and F1 score. Could this setting contribute to more false positives by impacting specificity the most?

Upvotes: 1

Views: 568

Answers (1)

gabalz
gabalz

Reputation: 412

Parameter scale_pos_weight scales only the weights of the positive samples, hence the documentation suggest setting it to sum(negatives)/sum(positives). It leaves the weights of the negative samples equal to one, so the sum of the scaled weights of the positive samples will be equal to the sum of the (unit) weights of the negative samples.

The scale_pos_weight parameter has to be positive, but it can be larger than one, so if you have more negative than positive samples, you can still use it and set it to sum(negatives)/sum(positives).

The point of this parameter to let XGBoost see the data as balanced, hence using it I think will affect the metrics which depend on class imbalance like precision and F1 score. The wikipedia page of F1 score seems to have some interesting references (12 and 13) for this topic, which you might find useful.


You can see this in the source code. Let's take xgboost code version 2.1.0. First visit src/objective/regression_param.h, where you can find briefly what I described above:

struct RegLossParam : public XGBoostParameter<RegLossParam> {
  float scale_pos_weight;
  // declare parameters
  DMLC_DECLARE_PARAMETER(RegLossParam) {
    DMLC_DECLARE_FIELD(scale_pos_weight).set_default(1.0f).set_lower_bound(0.0f)
      .describe("Scale the weight of positive examples by this factor");
  }
};

Parameter scale_pos_weight defaults to 1 and lower bounded by zero, it is however not upper bounded and the description says it scales the "positive examples".

Visiting file src/objective/regression_obj.cu can confirm the latter. There you can find

for (size_t idx = begin; idx < end; ++idx) {
  bst_float p = Loss::PredTransform(preds_ptr[idx]);
  bst_float w = _is_null_weight ? 1.0f : weights_ptr[idx / n_targets];
  bst_float label = labels_ptr[idx];
  if (label == 1.0f) {
    w *= _scale_pos_weight;
  }
  out_gpair_ptr[idx] = GradientPair(Loss::FirstOrderGradient(p, label) * w,
                                    Loss::SecondOrderGradient(p, label) * w);
}

which is the only nontrivial use of scale_pos_weight (as I see), and it indeed scales only the positive samples (having label 1).

Upvotes: 0

Related Questions