Natalie
Natalie

Reputation: 447

Amazon Sagemaker Groundtruth: Cannot get active learning to work

I am trying to test Sagemaker Groundtruth's active learning capability, but cannot figure out how to get the auto-labeling part to work. I started a previous labeling job with an initial model that I had to create manually. This allowed me to retrieve the model's ARN as a starting point for the next job. I uploaded 1,758 dataset objects and labeled 40 of them. I assumed the auto-labeling would take it from here, but the job in Sagemaker just says "complete" and is only displaying the labels that I created. How do I make the auto-labeler work?

Do I have to manually label 1,000 dataset objects before it can start working? I saw this post: Information regarding Amazon Sagemaker groundtruth, where the representative said that some of the 1,000 objects can be auto-labeled, but how is that possible if it needs 1,000 objects to start auto-labeling?

Thanks in advance.

Upvotes: 1

Views: 1132

Answers (1)

JonathanB-AWS
JonathanB-AWS

Reputation: 56

I'm an engineer at AWS. In order to understand the "active learning"/"automated data labeling" feature, it will be helpful to start with a broader recap of how SageMaker Ground Truth works.

First, let's consider the workflow without the active learning feature. Recall that Ground Truth annotates data in batches [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-batching.html]. This means that your dataset is submitted for annotation in "chunks." The size of these batches is controlled by the API parameter MaxConcurrentTaskCount [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#sagemaker-Type-HumanTaskConfig-MaxConcurrentTaskCount]. This parameter has a default value of 1,000. You cannot control this value when you use the AWS console, so the default value will be used unless you alter it by submitting your job via the API instead of the console.

Now, let's consider how active learning fits into this workflow. Active learning runs in between your batches of manual annotation. Another important detail is that Ground Truth will partition your dataset into a validation set and an unlabeled set. For datasets smaller than 5,000 objects, the validation set will be 20% of your total dataset; for datasets largert than 5,000 objects, the validation set will be 10% of your total dataset. Once the validation set is collected, any data that is subsequently annotated manually consistutes the training set. The collection of the validation set and training set proceeds according to the batch-wise process described in the previous paragraph. A longer discussion of active learning is available in [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html].

That last paragraph was a bit of a mouthful, so I'll provide an example using the numbers you gave.

Example #1

  • Default MaxConcurrentTaskCount ("batch size") of 1,000
  • Total dataset size: 1,758 objects
  • Computed validation set size: 0.2 * 1758 = 351 objects

Batch #

  1. Annotate 351 objects to populate the validation set (1407 remaining).
  2. Annotate 1,000 objects to populate the first iteration of the training set (407 remaining).
  3. Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 407 objects.
  4. (Assume no objects were automatically labeled in step #3) Annotate 407 objects. End labeling job.

Example #2

  • Non-default MaxConcurrentTaskCount ("batch size") of 250
  • Total dataset size: 1,758 objects
  • Computed validation set size: 0.2 * 1758 = 351 objects

Batch #

  1. Annotate 250 objects to begin populating the validation set (1508 remaining).
  2. Annotate 101 objects to finish populating the validation set (1407 remaining).
  3. Annotate 250 objects to populate the first iteration of the training set (1157 remaining).
  4. Run active learning. This step may, depending on the accuracy of the model at this stage, result in the annotation of zero, some, or all of the remaining 1157 objects. All else being equal, we would expect the model to be less accurate than the model in example #1 at this stage, because our training set is only 250 objects here.
  5. Repeat alternating steps of annotating batches of 250 objects and running active learning.

Hopefully these examples illustrate the workflow and help you understand the process a little better. Since your dataset consists of 1,758 objects, the upper bound on the number of automated labels that can be supplied is 407 objects (assuming you use the default MaxConcurrentTaskCount).

Ultimately, 1,758 objects is still a relatively small dataset. We typically recommend at least 5,000 objects to see meaningful results [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html]. Without knowing any other details of your labeling job, it's difficult to gauge why your job didn't result in more automated annotations. A useful starting point might be to inspect the annotations you received, and to determine the quality of the model that was trained during the Ground Truth labeling job.

Best regards from AWS!

Upvotes: 4

Related Questions