Using other types of data stores – AWS Services for Data Storage – MLS-C01 Study Guide

Using other types of data stores

Elastic Block Store (EBS) is used to create volumes in an Availability Zone. The volume can only be attached to an EC2 instance in the same Availability Zone. Amazon EBS provides both Solid-State Drive (SSD) and Hard Disk Drive (HDD) types of volumes. For SSD-based volumes, the dominant performance attribute is Input-Output Per Second (IOPS), and for HDD it is throughput, which is generally measured as MiB/s. You can choose between different volume types, such as General Purpose SSD (gp2), Provisioned IOPS SSD (io1), or Throughput Optimized HDD (st1), depending on your requirements. Provisioned IOPS volumes are often used for high-performance workloads, such as deep learning training, where low latency and high throughput are critical. Table 2.1 provides an overview of the different volumes and types:

Volume TypesUse cases
General Purpose SSD (gp2)Useful for maintaining balance between price and performance. Good for most workloads, system boot volumes, dev, and test environments
Provisioned IOPS SSD (io2, io1)Useful for mission-critical, high-throughput or low-latency workloads. For example, I/O intensive database workloads like MongoDB, Cassandra, Oracle
Throughput Optimized HDD (st1)Useful for frequently accessed, throughput-intensive workloads. For example, big data processing, data warehouses, log processing
Cold HDD (sc1)Useful for less frequently accessed workloads

Table 2.1 – Different volumes and their use cases

EBS is designed to be resilient within an Availability Zone (AZ). If, for some reason, an AZ fails, then the volume cannot be accessed. To prevent such scenarios, snapshots can be created from the EBS volumes, and they are stored in S3. Once the snapshot arrives in S3, the data in the snapshot becomes Region-resilient. The first snapshot is a full copy of data on the volume and, from then onward, snapshots are incremental. Snapshots can be used to clone a volume. As the snapshot is stored in S3, a volume can be cloned in any AZ in that Region. Snapshots can be shared between Regions and volumes can be cloned from them during disaster recovery. Even after the EC2 instance is stopped/terminated, EBS volumes can retain data through an easy restoration process from backed-up snapshots.

Multiple EC2 instances can be attached via EBS Multi-Attach for concurrent EBS volume access. If the use case demands multiple instances to access the training dataset simultaneously (distributed training scenarios), then EBS Multi-Attach will provide the solution with improved performance and scalability.

AWS KMS manages the CMK. AWS KMS uses an AWS-managed CMK for EBS, or AWS KMS can use a customer-managed CMK. The CMK is used by EBS when an encrypted volume is created. The CMK is used to create an encrypted DEK, which is stored with the volume on the physical disk. This DEK can only be decrypted using KMS, assuming the entity has access to decrypt. When a snapshot is created from the encrypted volume, the snapshot is encrypted with the same DEK. Any volume created from this snapshot also uses that DEK.

Instance Store volumes are the block storage devices physically connected to the EC2 instance. They provide the highest performance, as the ephemeral storage attached to the instance is from the same host where the instance is launched. EBS can be attached to the instance at any time, but the instance store must be attached to the instance at the time of its launch; it cannot be attached once the instance is launched. If there is an issue on the underlying host of an EC2 instance, then the same instance will be launched on another host with a new instance store volume and the earlier instance store (ephemeral storage) and old data will be lost. The size and capabilities of the attached volumes depend on the instance types and can be found in more detail here: https://aws.amazon.com/ec2/instance-types/.

Elastic File System (EFS) provides a network-based filesystem that can be mounted within Linux EC2 instances and can be used by multiple instances at once. It is an implementation of NFSv4. It can be used in general-purpose mode, max I/O performance mode (for scientific analysis or parallel computing), bursting mode, and provisioned throughput mode. This makes it ideal for scenarios where multiple instances need to train on large datasets or share model artifacts. With EFS, you can store training datasets, pre-trained models, and other data centrally, ensuring consistency and reducing data duplication. Additionally, EFS provides high throughput and low-latency access, enabling efficient data access during training and inference processes. By leveraging EFS with SageMaker, machine learning developers can seamlessly scale their workloads, collaborate effectively, and accelerate model development and training.

As you know, in the case of instance stores, the data is volatile. As soon as the instance is lost, the data is lost from the instance store. That is not the case for EFS. EFS is separate from the EC2 instance storage. EFS is a file store and is accessed by multiple EC2 instances via mount targets inside a VPC. On-premises systems can access EFS storage via hybrid networking to the VPC, such as VPN or Direct Connect. EFS also supports two types of storage classes: Standard and Infrequent Access. Standard is used for frequently accessed data. Infrequent Access is the cost-effective storage class for long-lived, less frequently accessed data. Lifecycle policies can be used for the transition of data between storage classes. EFS offers a pay-as-you-go pricing model, where you only pay for the storage capacity you use. It eliminates the need to provision and manage separate storage volumes for each instance, reducing storage costs and simplifying storage management for your machine learning workloads.

Important note

An instance store is preferred for max I/O requirements and if the data is replaceable and temporary.