Lawrence Coles
Lawrence Coles

Reputation: 133

TensorFlow - `keys` or `default_value` doesn't match the table data types

(Complete novice at python, machine learning, and TensorFlow)

I am attempting to adapt the TensorFlow Linear Model Tutorial from their offical documentation to the Abalone dataset featured on the ICU machine learning repository. The intent is to guess the rings(age) of an abalone from the other given data.

When running the below program I get the following:

File "/home/lawrence/tensorflow3.5/lib/python3.5/site-packages/tensorflow             /python/ops/lookup_ops.py", line 220, in lookup
(self._key_dtype, keys.dtype))
TypeError: Signature mismatch. Keys must be dtype <dtype: 'string'>, got <dtype: 'int32'>.

The error is being thrown in lookup_ops.py at line 220 and is documented as being thrown when:

    Raises:
      TypeError: when `keys` or `default_value` doesn't match the table data types.

From debugging parse_csv() it seems to be the case that all the tensors are created with the correct type.

Could you please explain what is going wrong? I believe I am following the tutorial code logic and cannot figure this out.

Source Code:

import tensorflow as tf
import shutil

_CSV_COLUMNS = [
    'sex', 'length', 'diameter', 'height', 'whole_weight',
    'shucked_weight', 'viscera_weight', 'shell_weight', 'rings'
]

_CSV_COLUMN_DEFAULTS = [['M'], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0]]

_NUM_EXAMPLES = {
    'train': 3000,
    'validation': 1177,
}

def build_model_columns():
  """Builds a set of wide feature columns."""
   # Continuous columns
  sex = tf.feature_column.categorical_column_with_hash_bucket('sex', hash_bucket_size=1000)
  length = tf.feature_column.numeric_column('length', dtype=tf.float32)
  diameter = tf.feature_column.numeric_column('diameter', dtype=tf.float32)
  height = tf.feature_column.numeric_column('height', dtype=tf.float32)
  whole_weight = tf.feature_column.numeric_column('whole_weight', dtype=tf.float32)
  shucked_weight = tf.feature_column.numeric_column('shucked_weight', dtype=tf.float32)
  viscera_weight = tf.feature_column.numeric_column('viscera_weight', dtype=tf.float32)
  shell_weight = tf.feature_column.numeric_column('shell_weight', dtype=tf.float32)

  base_columns = [sex, length, diameter, height, whole_weight,
                  shucked_weight, viscera_weight, shell_weight]

  return base_columns

def build_estimator():
  """Build an estimator appropriate for the given model type."""
  base_columns = build_model_columns()

  return tf.estimator.LinearClassifier(
      model_dir="~/models/albones/",
      feature_columns=base_columns,
      label_vocabulary=_CSV_COLUMNS)


 def input_fn(data_file, num_epochs, shuffle, batch_size):
   """Generate an input function for the Estimator."""
   assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have either run data_download.py or '
      'set both arguments --train_data and --test_data.' % data_file)

  def parse_csv(value):
      print('Parsing', data_file)
      columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
      features = dict(zip(_CSV_COLUMNS, columns))
      labels = features.pop('rings')

      return features, labels

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = dataset.repeat(num_epochs)
  dataset = dataset.batch(batch_size)

  iterator = dataset.make_one_shot_iterator()
  features, labels = iterator.get_next()

  return features, labels

def main(unused_argv):
  # Clean up the model directory if present
  shutil.rmtree("/home/lawrence/models/albones/", ignore_errors=True)
  model = build_estimator()

  # Train and evaluate the model every `FLAGS.epochs_per_eval` epochs.
  for n in range(40 // 2):
    model.train(input_fn=lambda: input_fn(
        "/home/lawrence/abalone.data", 2, True, 40))

    results = model.evaluate(input_fn=lambda: input_fn(
        "/home/lawrence/abalone.data", 1, False, 40))

    # Display evaluation metrics
    print('Results at epoch', (n + 1) * 2)
    print('-' * 60)

    for key in sorted(results):
      print('%s: %s' % (key, results[key]))


if __name__ == '__main__':
    tf.logging.set_verbosity(tf.logging.INFO)
    tf.app.run(main=main)

Here is the classification of the columns of the dataset from abalone.names:

Name            Data Type   Meas.   Description
----            ---------   -----   -----------
Sex             nominal             M, F, [or] I (infant)
Length          continuous  mm      Longest shell measurement
Diameter        continuous  mm      perpendicular to length
Height          continuous  mm      with meat in shell
Whole weight    continuous  grams   whole abalone
Shucked weight  continuous  grams   weight of meat
Viscera weight  continuous  grams   gut weight (after bleeding)
Shell weight    continuous  grams   after being dried
Rings           integer             +1.5 gives the age in years

Dataset entries appear in this order as common separated values with a new line for a new entry.

Upvotes: 2

Views: 722

Answers (1)

Maxim
Maxim

Reputation: 53758

You've done almost everything right. The problem is with the definition of an estimator.

The task is to predict the Rings column, which is an integer, so it looks like a regression problem. But you've decided to do a classification task, which is also valid:

def build_estimator():
  """Build an estimator appropriate for the given model type."""
  base_columns = build_model_columns()

  return tf.estimator.LinearClassifier(
      model_dir="~/models/albones/",
      feature_columns=base_columns,
      label_vocabulary=_CSV_COLUMNS)

By default, tf.estimator.LinearClassifier assumes binary classification, i.e., n_classes=2. In your case, it's obviously not true - that's the first bug. You've also set label_vocabulary, which tensorflow interprets as a set of possible values in the label column. That's why it expects tf.string dtype. Since Rings is an integer, you simply don't need label_vocabulary at all.

Combining it all together:

def build_estimator():
  """Build an estimator appropriate for the given model type."""
  base_columns = build_model_columns()

  return tf.estimator.LinearClassifier(
    model_dir="~/models/albones/",
    feature_columns=base_columns,
    n_classes=30)

I suggest you also try tf.estimator.LinearRegressor, which will probably be more accurate.

Upvotes: 1

Related Questions