Reputation: 3369
I am training a model using BigQuery ML, my input has several fields, one of which is a customer number, this number is not useful as a prediction feature, but I do need it in the final output so that I can reference which users scored high vs. low. How can I exclude this column from the model training without removing it completely?
Reading the docs the only way I can see to exclude columns is by adding it to input_label_cols
which it clearly is not, or data_split_col
which is not desirable.
Upvotes: 3
Views: 1083
Reputation: 172964
You do not need include into model fields that not need to be part of model - not at all.
Rather, you need to include them during the prediction
For example in below model you have only 6 fields as input (carrier, origin, dest, dep_delay, taxi_out, distance)
#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
(model_type='logistic_reg', input_label_cols=['on_time']) AS
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
While in prediction you can have all extra fields available, like below (and you can put them in any position of SELECT - but note - predicted columns will go first:
#standardsql
SELECT * FROM ml.PREDICT(MODEL `cloud-training-demos.flights.ontime`, (
SELECT
UNIQUE_CARRIER, -- extra column
ORIGIN_AIRPORT_ID, -- extra column
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
LIMIT 5
))
Obviously input_label_cols and data_split_col
are for different purposes
input_label_cols STRING The label column name(s) in the training data.
data_split_col STRING This option identifies the column used to split the data [into training and evaluation sets]. This column cannot be used as a feature or label, and will be excluded from features automatically.
Upvotes: 4