Hyperparameter tuning with ml-engine returns State: failed

Question

I'm trying to get my models hyperparameters tuned with the ml-engine but i'm not quite sure if its working or not.

I'm not specifying the algorithm tag in HyperparameterSpec, which should default to Bayesian optimization method according to the documentation. Im also not setting maxFailedTrials, which according to the documentation, should end all trails if the first one fails.

Here is my config

trainingInput:
  scaleTier: CUSTOM
  masterType: standard_gpu
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 8
    maxParallelTrials: 2
    hyperparameterMetricTag: test_accuracy
    params:
    - parameterName: dropout_rate
      type: DOUBLE
      minValue: 0.3
      maxValue: 0.7
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: lr
      type: DOUBLE
      minValue: 0.0001
      maxValue: 0.0003
      scaleType: UNIT_LINEAR_SCALE

And here is the training output:

{
  "completedTrialCount": "8",
  "trials": [
    {
      "trialId": "1",
      "hyperparameters": {
        "lr": "0.00014959385395050048",
        "dropout_rate": "0.42217149734497067"
      },
      "startTime": "2019-10-07T09:40:02.143968039Z",
      "endTime": "2019-10-07T09:47:50Z",
      "state": "FAILED"
    },
    {
      "trialId": "2",
      "hyperparameters": {
        "dropout_rate": "0.62217149734497068",
        "lr": "0.00028292718728383382"
      },
      "startTime": "2019-10-07T09:40:02.144192681Z",
      "endTime": "2019-10-07T09:47:19Z",
      "state": "FAILED"
    },
    {
      "trialId": "3",
      "hyperparameters": {
        "lr": "0.00014846909046173097",
        "dropout_rate": "0.31717863082885739"
      },
      "startTime": "2019-10-07T09:48:09.266596472Z",
      "endTime": "2019-10-07T09:55:26Z",
      "state": "FAILED"
    },
    {
      "trialId": "4",
      "hyperparameters": {
        "lr": "0.00018741662502288819",
        "dropout_rate": "0.34178204536437984"
      },
      "startTime": "2019-10-07T09:48:10.761305330Z",
      "endTime": "2019-10-07T09:55:58Z",
      "state": "FAILED"
    },
    {
      "trialId": "5",
      "hyperparameters": {
        "dropout_rate": "0.6216828346252441",
        "lr": "0.00010192830562591553"
      },
      "startTime": "2019-10-07T09:56:15.904704865Z",
      "endTime": "2019-10-07T10:04:04Z",
      "state": "FAILED"
    },
    {
      "trialId": "6",
      "hyperparameters": {
        "dropout_rate": "0.42288427352905272",
        "lr": "0.000230206298828125"
      },
      "startTime": "2019-10-07T09:56:17.895067636Z",
      "endTime": "2019-10-07T10:04:05Z",
      "state": "FAILED"
    },
    {
      "trialId": "7",
      "hyperparameters": {
        "lr": "0.00019101441543291624",
        "dropout_rate": "0.36415641310447144"
      },
      "startTime": "2019-10-07T10:05:22.147233194Z",
      "endTime": "2019-10-07T10:13:09Z",
      "state": "FAILED"
    },
    {
      "trialId": "8",
      "hyperparameters": {
        "dropout_rate": "0.69955616224911532",
        "lr": "0.00029989311482522672"
      },
      "startTime": "2019-10-07T10:05:22.147396438Z",
      "endTime": "2019-10-07T10:13:30Z",
      "state": "FAILED"
    }
  ],
  "consumedMLUnits": 2.29,
  "isHyperparameterTuningJob": true,
  "hyperparameterMetricTag": "test_accuracy"
}

All trails are run, so I believe its the search algorithm that fails for some reason. I haven't been able to locate any more information on why its returns this or any logs from the search algorithm by running with another verbosity.

To me it seems like its not able to locate the metric in the tensorflow event files, but I don't understand why, since the name is exactly the same, opening the event files with tensorboard i'm able to see the data. Maybe there is some requirements for the log structure i'm not aware of?

The code for logging metrics:

from tensorflow.contrib.summary import summary as summary_ops

# in __init__
self.tf_board_writer = summary_ops.create_file_writer(self.save_path)
....

# During training
with self.tf_board_writer.as_default(), summary_ops.always_record_summaries():
    summary_ops.scalar(name=name, tensor=value, step=step)

Small side question if any from the ml-engine team ends up in here, now that TF2 is stable and released, do you have any idea when it will be available in the runtime environment?

Anyways, hope someone can help me out :)

ellonde · Accepted Answer

The problem could be solved by using the python package cloudml-hypertune with the following code:

self.hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag=hypeparam_metric_name,
            metric_value=value,
            global_step=step)

And then set hyperparameterMetricTag in HyperparameterSpec to hypeparam_metric_name

Hyperparameter tuning with ml-engine returns State: failed

Answers (1)

Related Questions