Training Data Location and Formats – Amazon SageMaker Modeling – MLS-C01 Study Guide

Training Data Location and Formats

As you embark on the journey of setting up your AWS SageMaker training job, understanding the diverse data storage and reading options is crucial. To ensure a seamless training experience, delve into the supported options and their benefits.

First you will look at the supported data storage options:

  • Amazon Simple Storage Service (Amazon S3):
    • Overview: Amazon SageMaker provides robust support for storing training datasets in Amazon S3, offering reliability and scalability.
    • Usage Example: You can configure your dataset using an Amazon S3 prefix, manifest file, or augmented manifest file.
  • Amazon Elastic File System (Amazon EFS):
    • Overview: SageMaker extends its support to Amazon EFS, facilitating file system access to the dataset.
    • Usage Example: Data stored in Amazon EFS must be pre-existing before initiating the training job.
  • Amazon FSx for Lustre:
    • Overview: Achieving high throughput and low-latency file retrieval, SageMaker mounts the FSx for Lustre file system to the training instance.
    • Usage Example: FSx for Lustre can scale seamlessly, providing a performant option for your training data.

Here are the input modes for data access:

  • File Mode:
    • Overview: Default input mode where SageMaker downloads the entire dataset to the Docker container before training starts.
    • Usage Example: Compatible with SageMaker local mode and supports sharding for distributed training.
  • Fast File Mode:
    • Overview: Combining file system access with the efficiency of pipe mode, fast file mode identifies data files at the start but delays the download until necessary.
    • Usage Example: Streamlines training startup time, particularly beneficial when dealing with a large dataset.
  • Pipe Mode:
    • Overview: Streams data directly from an Amazon S3 data source, providing faster start times and better throughput.
    • Usage Example: Historically used, but largely replaced by the simpler-to-use fast file mode.

And lastly, look at the specialized storage classes:

  • Amazon S3 Express One Zone:
    • Overview: A high-performance, single Availability Zone storage class, optimizing compute performance and costs.
    • Usage Example: Supports file mode, fast file mode, and pipe mode for SageMaker model training.
  • Amazon EFS and Amazon FSx for Lustre:
    • Overview: SageMaker supports both Amazon EFS and Amazon FSx for Lustre, offering flexibility in choosing the right storage solution for your training data.
    • Usage Example: Mounting the file systems to the training instance ensures seamless access during training.

Understanding the nuances of data storage and reading options for AWS SageMaker training jobs empowers you to tailor your setup to specific requirements. In the upcoming sections, you’ll explore more facets of AWS SageMaker to deepen your understanding and proficiency in machine learning workflows. Let’s put our knowledge to work in the next section.