To kickstart auto scaling for your model, you can take advantage of the SageMaker console, AWS Command Line Interface (AWS CLI), or an AWS SDK through the Application Auto Scaling API. For those inclined towards the CLI or API, the process involves registering the model as a scalable target, defining the scaling policy, and then applying it. If you opt for the SageMaker console, simply navigate to Endpoints under Inference in the navigation pane, locate your model’s endpoint name, and choose it along with the variant name to activate auto scaling.
Let’s now dive into the intricacies of scaling policies.
Auto scaling is driven by scaling policies, which determine how instances are added or removed in response to varying workloads. Two options are at your disposal: target tracking and step scaling policies.
Target Tracking Scaling Policies: Our recommendation is to leverage target tracking scaling policies. Here, you select a CloudWatch metric and set a target value. Auto scaling takes care of creating and managing CloudWatch alarms, adjusting the number of instances to maintain the metric close to the specified target value. For instance, a scaling policy targeting the InvocationsPerInstance metric with a target value of 70 ensures the metric hovers around that value.
Step Scaling Policies: Step scaling is for advanced configurations, allowing you to specify instance deployment under specific conditions. However, for simplicity and full automation, target tracking scaling is preferred. Note that step scaling is managed exclusively through the AWS CLI or Application Auto Scaling API.
Creating a target tracking scaling policy involves specifying the metric, such as the average number of invocations per instance, and the target value, for example, 70 invocations per instance per minute. You have the flexibility to create target tracking scaling policies based on predefined or custom metrics. Cooldown periods, which prevent rapid capacity fluctuations, can also be configured optionally.
Scheduled actions enable scaling activities at specific times, either as a one-time event or on a recurring schedule. These actions can work in tandem with your scaling policy, allowing dynamic decisions based on changing workloads. Scheduled scaling is managed exclusively through the AWS CLI or Application Auto Scaling API.
Before crafting a scaling policy, it’s essential to set minimum and maximum scaling limits. The minimum value, set to at least 1, represents the minimum number of instances, while the maximum value signifies the upper cap. SageMaker auto scaling adheres to these limits and automatically scales in to the minimum specified instances when traffic becomes zero.
You have three options to specify these limits:
The cooldown period is pivotal for preventing over-scaling during scale-in or scale-out activities. It slows down subsequent scaling actions until the period expires, safeguarding against rapid capacity fluctuations. You can configure the cooldown period within your scaling policy.
If not specified, the default cooldown period is 300 seconds for both scale-in and scale-out. Adjust this value based on your model’s traffic characteristics; consider increasing for frequent spikes or multiple scaling policies, and decrease if instances need to be added swiftly.
As you embark on optimizing your model’s scalability, keep these configurations in mind to ensure a seamless and cost-effective experience. In the next section, you will dive into and understand the different ways of securing our Amazon SageMaker notebooks.