Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

How to fix SageMaker data-quality monitoring-schedule job that fails with 'FailureReason': 'Job inputs had no data'

I am trying to schedule a data-quality monitoring job in AWS SageMaker by following steps mentioned in this AWS documentation page. I have enabled data-capture for my endpoint. Then, trained a baseline on my training csv file and statistics and constraints are available in S3 like this:

from sagemaker import get_execution_role
from sagemaker import image_uris
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_data_monitor = DefaultModelMonitor(
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    volume_size_in_gb=30,
    max_runtime_in_seconds=3_600)

# base s3 directory
baseline_dir_uri = 's3://api-trial/data_quality_no_headers/'
# train data, that I have used to generate baseline
baseline_data_uri = baseline_dir_uri + 'ch_train_no_target.csv'
# directory in s3 bucket that I have stored my baseline results to 
baseline_results_uri = baseline_dir_uri + 'baseline_results_try17/'


my_data_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri,
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True, logs=False, job_name='ch-dq-baseline-try21'
)

and data is available in S3: enter image description here

Then I tried scheduling a monitoring job by following this example notebook for model-quality-monitoring in sagemaker-examples github repo, to schedule my data-quality-monitoring job by making necessary modifications with feedback from error messages.

Here's how tried to schedule the data-quality monitoring job from SageMaker Studio:

from sagemaker import get_execution_role
from sagemaker.model_monitor import EndpointInput
from sagemaker import image_uris
from sagemaker.model_monitor import CronExpressionGenerator
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# base s3 directory
baseline_dir_uri = 's3://api-trial/data_quality_no_headers/'

# train data, that I have used to generate baseline
baseline_data_uri = baseline_dir_uri + 'ch_train_no_target.csv'

# directory in s3 bucket that I have stored my baseline results to 
baseline_results_uri = baseline_dir_uri + 'baseline_results_try17/'
# s3 locations of baseline job outputs
baseline_statistics = baseline_results_uri + 'statistics.json'
baseline_constraints = baseline_results_uri + 'constraints.json'

# directory in s3 bucket that I would like to store results of monitoring schedules in
monitoring_outputs = baseline_dir_uri + 'monitoring_results_try17/'

ch_dq_ep = EndpointInput(endpoint_name=myendpoint_name,
                         destination="/opt/ml/processing/input_data",
                         s3_input_mode="File",
                         s3_data_distribution_type="FullyReplicated")

monitor_schedule_name='ch-dq-monitor-schdl-try21'

my_data_monitor.create_monitoring_schedule(endpoint_input=ch_dq_ep,
                                           monitor_schedule_name=monitor_schedule_name,
                                           output_s3_uri=baseline_dir_uri,
                                           constraints=baseline_constraints,
                                           statistics=baseline_statistics,
                                           schedule_cron_expression=CronExpressionGenerator.hourly(),
                                           enable_cloudwatch_metrics=True)

after an hour or so, when I check the status of the schedule like this:

import boto3
boto3_sm_client = boto3.client('sagemaker')
boto3_sm_client.describe_monitoring_schedule(MonitoringScheduleName='ch-dq-monitor-schdl-try17')

I get failed status like below:

'MonitoringExecutionStatus': 'Failed',
  ...
  'FailureReason': 'Job inputs had no data'},

Entire Message:

```
{'MonitoringScheduleArn': 'arn:aws:sagemaker:ap-south-1:<my-account-id>:monitoring-schedule/ch-dq-monitor-schdl-try21',
 'MonitoringScheduleName': 'ch-dq-monitor-schdl-try21',
 'MonitoringScheduleStatus': 'Scheduled',
 'MonitoringType': 'DataQuality',
 'CreationTime': datetime.datetime(2021, 9, 14, 13, 7, 31, 899000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 9, 14, 14, 1, 13, 247000, tzinfo=tzlocal()),
 'MonitoringScheduleConfig': {'ScheduleConfig': {'ScheduleExpression': 'cron(0 * ? * * *)'},
  'MonitoringJobDefinitionName': 'data-quality-job-definition-2021-09-14-13-07-31-483',
  'MonitoringType': 'DataQuality'},
 'EndpointName': 'ch-dq-nh-try21',
 'LastMonitoringExecutionSummary': {'MonitoringScheduleName': 'ch-dq-monitor-schdl-try21',
  'ScheduledTime': datetime.datetime(2021, 9, 14, 14, 0, tzinfo=tzlocal()),
  'CreationTime': datetime.datetime(2021, 9, 14, 14, 1, 9, 405000, tzinfo=tzlocal()),
  'LastModifiedTime': datetime.datetime(2021, 9, 14, 14, 1, 13, 236000, tzinfo=tzlocal()),
  'MonitoringExecutionStatus': 'Failed',
  'EndpointName': 'ch-dq-nh-try21',
  'FailureReason': 'Job inputs had no data'},
 'ResponseMetadata': {'RequestId': 'dd729244-fde9-44b5-9904-066eea3a49bb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dd729244-fde9-44b5-9904-066eea3a49bb',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '835',
   'date': 'Tue, 14 Sep 2021 14:27:53 GMT'},
  'RetryAttempts': 0}}
```

Possible things you might think to have gone wrong at my side or might help me fix my issue:

  1. dataset used for baseline: I have tried to create a baseline with the dataset with and without my target-variable(or dependent variable or y) and the error persisted both times. So, I think the error has originated because of a different reason.
  2. there are no log groups created for these jobs for me to look at and try debug the issue. baseline jobs have log-groups, so i presume there is no problem with roles being used for monitoring-schedule-jobs not having permissions to create a log group or stream.
  3. role: the role I have attached is defined by get_execution_role(), which points to a role with full access to sagemaker, cloudwatch, S3 and some other services.
  4. the data collected from my endpoint during my inference: here's how a line of data of .jsonl file saved to S3, which contains data collected during inference, looks like:
{"captureData":{"endpointInput":{"observedContentType":"application/json","mode":"INPUT","data":"{\"longitude\": [-122.32, -117.58], \"latitude\": [37.55, 33.6], \"housing_median_age\": [50.0, 5.0], \"total_rooms\": [2501.0, 5348.0], \"total_bedrooms\": [433.0, 659.0], \"population\": [1050.0, 1862.0], \"households\": [410.0, 555.0], \"median_income\": [4.6406, 11.0567]}","encoding":"JSON"},"endpointOutput":{"observedContentType":"text/html; charset=utf-8","mode":"OUTPUT","data":"eyJtZWRpYW5faG91c2VfdmFsdWUiOiBbNDUyOTU3LjY5LCA0NjcyMTQuNF19","encoding":"BASE64"}},"eventMetadata":{"eventId":"9804d438-eb4c-4cb4-8f1b-d0c832b641aa","inferenceId":"ef07163d-ea2d-4730-92f3-d755bc04ae0d","inferenceTime":"2021-09-14T13:59:03Z"},"eventVersion":"0"}

I would like to know what has gone wrong in this entire process, that led to data not being fed to my monitoring job.

Upvotes: 4

Views: 1732

Answers (2)

theansaricode
theansaricode

Reputation: 132

You are facing this issue because it is not able to find the captured data, if the model monitor is scheduled to hourly(), then no one used the deployed model for prediction in the previous hour which is why there is no captured data in the s3 bucket for the previous hour. you will have to use your deployed model for predictions then it will save the captured data into s3 bucket. then wait for the model monitor job to be run in the next hour.

use below code snippet for prediction on validation.csv dataset.

from time import sleep

validate_dataset = "validation_with_predictions.csv"
sm_client = boto3.client("sagemaker-runtime")
# Cut off threshold of 80%
cutoff = 0.8

limit = 200 # Need at least 200 samples to compute standard deviations
i = 0
with open(f"test_data/{validate_dataset}", "w") as validation_file:
    validation_file.write("prediction,label\n") # CSV header
    with open("test_data/validation.csv", "r") as f:
        for row in f:
            (label, input_cols) = row.split(",", 1)
            res = sm_client.invoke_endpoint(EndpointName=endpoint_name,
                          ContentType='text/csv',
                          Body=input_cols)
            prediction = res["Body"].read().decode()
            # prediction = "1" if probability > cutoff else "0"
            validation_file.write(f"{prediction},{label}\n")
            i += 1
            if i > limit:
                break
            print(".", end="", flush=True)
            sleep(0.5)
print()
print("Done!")

enter image description here

enter image description here

enter image description here

Upvotes: 1

Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

This happens, during the ground-truth-merge job, when the spark can't find any data either in '/opt/ml/processing/groundtruth/' or '/opt/ml/processing/input_data/' directories. And that can happen when either you haven't sent any requests to the sagemaker endpoint or there are no ground truths.

I got this error because, the folder /opt/ml/processing/input_data/ of the docker volume mapped to the monitoring container had no data to process. And that happened because, the thing that facilitates entire process, including fetching data couldn't find any in S3. and that happened because, there was an extra slash(/) in the directory to which endpoint's captured-data will be saved. to elaborate, while creating the endpoint, I had mentioned the directory as s3://<bucket-name>/<folder-1>/, while it should have just been s3://<bucket-name>/<folder-1>. so, while the thing that copies data from S3 to docker volume tried to fetch data of that hour, the directory it tried to extract the data from was s3://<bucket-name>/<folder-1>//<endpoint-name>/<variant-name>/<year>/<month>/<date>/<hour>(notice the two slashes). So, when I created the endpoint-configuration again with the slash removed in S3 directory, this error wasn't present and ground-truth-merge operation was successful as part of model-quality-monitoring.

I am answering this question because, someone read the question and upvoted it. meaning, someone else has faced this problem too. so, I have mentioned what worked for me. And I wrote this, so that StackExchange doesn't think I am spamming the forum with questions.

Upvotes: 5

Related Questions