Ben P
Ben P

Reputation: 3369

Excluding columns from training data in BigQuery ML

I am training a model using BigQuery ML, my input has several fields, one of which is a customer number, this number is not useful as a prediction feature, but I do need it in the final output so that I can reference which users scored high vs. low. How can I exclude this column from the model training without removing it completely?

Reading the docs the only way I can see to exclude columns is by adding it to input_label_cols which it clearly is not, or data_split_col which is not desirable.

Upvotes: 3

Views: 1083

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172964

You do not need include into model fields that not need to be part of model - not at all.
Rather, you need to include them during the prediction

For example in below model you have only 6 fields as input (carrier, origin, dest, dep_delay, taxi_out, distance)

#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
  (model_type='logistic_reg', input_label_cols=['on_time']) AS
SELECT
  IF(arr_delay < 15, 1, 0) AS on_time,
  carrier,
  origin,
  dest,
  dep_delay,
  taxi_out,
  distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL   

While in prediction you can have all extra fields available, like below (and you can put them in any position of SELECT - but note - predicted columns will go first:

#standardsql
SELECT * FROM ml.PREDICT(MODEL `cloud-training-demos.flights.ontime`, (
  SELECT
    UNIQUE_CARRIER,         -- extra column
    ORIGIN_AIRPORT_ID,      -- extra column
    IF(arr_delay < 15, 1, 0) AS on_time,
    carrier,
    origin,
    dest,
    dep_delay,
    taxi_out,
    distance
  FROM `cloud-training-demos.flights.tzcorr`
  WHERE arr_delay IS NOT NULL
  LIMIT 5
))   

Obviously input_label_cols and data_split_col are for different purposes

input_label_cols STRING The label column name(s) in the training data.

data_split_col STRING This option identifies the column used to split the data [into training and evaluation sets]. This column cannot be used as a feature or label, and will be excluded from features automatically.

Upvotes: 4

Related Questions