AWS Services for Data Storage – MLS-C01 Study Guide

AWS provides a wide range of services to store your data safely and securely. There are various storage options available on AWS, such as block storage, file storage, and object storage. It is expensive to manage on-premises data storage due to the higher investment in hardware, admin overheads, and managing system upgrades. With AWS storage services, you just pay for what you use, and you don’t have to manage the hardware. You will also learn about various storage classes offered by Amazon S3 for intelligent access to data and to reduce costs. You can expect questions in the exam on storage classes. As you continue through this chapter, you will master the single-AZ and multi-AZ instances, and Recovery Time Objective (RTO) and Recovery Point Objective (RPO) concepts of Amazon RDS.

In this chapter, you will learn about storing your data securely for further analytical purposes throughout the following sections:

  • Storing data on Amazon S3
  • Controlling access on S3 buckets and objects
  • Protecting data on Amazon S3
  • Securing S3 objects at rest and in transit
  • Using other types of data stores
  • Relational Database Services (RDSes)
  • Managing failover in Amazon RDS
  • Taking automatic backup, RDS snapshots, and restore and read replicas
  • Writing to Amazon Aurora with multi-master capabilities
  • Storing columnar data on Amazon Redshift
  • Amazon DynamoDB for NoSQL databases as a service

Technical requirements

All you will need for this chapter is an AWS account and the AWS CLI configured. The steps to configure the AWS CLI for your account are explained in detail by Amazon here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.

You can download the code examples from GitHub, here: https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter02.

Storing Data on Amazon S3

S3 is Amazon’s cloud-based object storage service, and it can be accessed from anywhere via the internet. It is an ideal storage option for large datasets. It is region-based, as your data is stored in a particular region until you move the data to a different region. Your data will never leave that region until it is configured to do so. In a particular region, data is replicated in the availability zones of that region; this makes S3 regionally resilient. If any of the availability zones fail in a region, then other availability zones will serve your requests. S3 can be accessed via the AWS console UI, AWS CLI, AWS API requests, or via standard HTTP methods.

S3 has two main components: buckets and objects.

  • Buckets are created in a specific AWS region. Buckets can contain objects but cannot contain other buckets.
  • Objects have two main attributes. One is the key, and the other is the value. The value is the content being stored, and the key is the name. The maximum size of an object can be 5 TB. As per the Amazon S3 documentation (https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingObjects.html), objects also have a version ID, metadata, access control information, and sub-resources.