Important note
To remember this easily, you can think of t for Tiny, m for Medium, c for Compute, and p and g for GPU. The CPU-related family instance types are t, m, r, and c. The GPU-related family instance types are p and g.
There is no rule of thumb to determine the instance type that you require. It changes based on the size of the data, the complexity of the network, the ML algorithm in question, and several other factors such as time and cost. Asking the right questions will allow you to save money and make your project cost-effective.
If the deciding factor is instance size, then classifying the problem as one for CPUs or GPUs is the right step. Once that is done, then it is good to consider whether it could be multi-GPU or multi-CPU, answering the question about distributed training. This also solves your instance count factor. If it’s compute intensive, then it would be wise to check the memory requirements too.
The next deciding factor is the instance family. The right question here is, is the chosen instance optimized for time and cost? In the previous step, you figured out whether the problem can be solved best by either a CPU or GPU, and this narrows down the selection process. Now, let’s learn about inference jobs.
The majority of the cost and complexity of ML in production is inference. Usually, inference runs on a single input in real time. Inference jobs are usually less compute/memory-intensive. They have to be highly available as they run all the time and serve end-user requests or are integrated into a wider application.
You can choose any of the instance types that you learned about so far based on the given workload. Other than that, AWS has Inf1 and Elastic Inference type instances for inference. Elastic inference allows you to attach a fraction of a GPU instance to any CPU instance.
Let’s look at an example where an application is integrated with inference jobs. In this case, the CPU and memory requirements for the application are different from the inference jobs’ CPU and memory requirements. For use cases such as this, you need to choose the right instance type and size. In such scenarios, it is good to have a separation between your application fleets and inference fleets. This might require some management. If such management is a problem for your requirement, then choose Elastic Inference, where the application and inference jobs can be colocated. This means that you can host multiple models on the same fleet, and you can load all of these different models on different accelerators in memory, and concurrent requests can be served.
It’s always recommended that you run some examples in a lower environment before deciding on your instance types and family in the production environment. For Production environments, you need to manage your scalability configurations for your Amazon SageMaker hosted models. You will understand this in the next section.