S3 sync is an AWS CLI feature that can be a great option when you simply want to copy a large number of files from your production server to AWS. S3 sync creates a synchronization list of files on the local directory with an S3 bucket. The synchronization can also be done in both directions. All you need is a single AWS CLI command, and the data can easily be synchronized.
For this example, let’s say you have a bucket called mylittlesyncbucket and five files called sync01-05.txt in the local directory. Now you can run the S3 sync command:
Click here to view code image
aws s3 sync . s3://mylittlesyncbucket
The result should look like this output. Note that the files are uploaded in random sequence:
Click here to view code image
upload: .\sync03.txt to s3://mylittlesyncbucket/sync03.txt
upload: .\sync01.txt to s3://mylittlesyncbucket/sync01.txt
upload: .\sync04.txt to s3://mylittlesyncbucket/sync04.txt
upload: .\sync05.txt to s3://mylittlesyncbucket/sync05.txt
upload: .\sync02.txt to s3://mylittlesyncbucket/sync02.txt
When you log in to the AWS console, you can see the bucket has been synced with the local directory, as shown in Figure 6.3.
FIGURE 6.3 The S3 bucket has been synced
If you add more files to the directory or change the content of those files and then rerun the s3 sync command, only the files that have changed are copied over.
To ensure all file versions are stored on S3, you can enable versioning on the S3 bucket you are syncing to; thus, you can prevent any corruption of the production files from corrupting the backups in S3.
The S3 sync approach requires you to do a bit of scripting, and this can be difficult to manage at scale. To avoid having to create your own scripting, you can use the AWS DataSync service. DataSync can synchronize file systems in the production location with other file systems in the backup location or with S3. The DataSync service requires you to use a DataSync agent that has access to the source file system. The DataSync agent uses a secure connection to the DataSync service and employs traffic optimization when transferring data. DataSync can be up to 10 times faster than other solutions for transferring data, making it a great solution to choose when syncing data from on-premises file systems to AWS.
To maintain a record of all changes to a file in AWS S3, you can enable bucket versioning. Versioning forces every change to an object (insert, upload, delete, patch) with the same key to be stored as a separate copy under the same key with an incremented version identifier. Figure 6.4 demonstrates this process.
FIGURE 6.4 S3 versioning
All S3 buckets are created without versioning. Versioning can be enabled, meaning all copies of all objects are retained, but a versioned bucket also can be suspended. When bucket versioning is suspended, all objects have a “null” version generated, and this null version is overwritten. However, all older versions in a suspended bucket persist, as demonstrated in Figure 6.5.
FIGURE 6.5 S3 versioning suspended
Any request for the object in the versioned bucket returns the latest version; however, you can also retrieve a specific version of an object in the versioned bucket by specifying the version identifier.
As discussed in the preceding chapter, any objects and versions of those objects can also be life-cycled to a cheaper S3 storage tier. This means that you can either move any older versions to the archive or delete them entirely.