Storing and transforming real-time data using Kinesis Data Firehose – AWS Services for Data Migration and Processing – MLS-C01 Study Guide

Storing and transforming real-time data using Kinesis Data Firehose

There are a lot of use cases that require data to be streamed and stored for future analytics purposes. To overcome such problems, you can write a Kinesis consumer to read the Kinesis stream and store the data in S3. This solution needs an instance or a machine to run the code with the required access to read from the stream and write to S3. The other possible option would be to run a Lambda function that gets triggered on the putRecord or putRecords API made to the stream and reads the data from the stream to store in the S3 bucket:

  • To make this easy, Amazon provides a separate service called Kinesis Data Firehose. This can easily be plugged into a Kinesis data stream and will require essential IAM roles to write data into S3. It is a fully managed service to reduce the load of managing servers and code. It also supports loading the streamed data into Amazon Redshift, Elasticsearch, and Splunk. Kinesis Data Firehose scales automatically to match the throughput of the data.
  • Data can be transformed via an AWS Lambda function before storing or delivering it to the destination. If you want to build a raw data lake with the untransformed data, then by enabling source record backup, you can store it in another S3 bucket prior to the transformation.
  • With the help of AWS KMS, data can be encrypted following delivery to the S3 bucket. It has to be enabled while creating the delivery stream. Data can also be compressed in supported formats, such as gzip, ZIP, or Snappy.

In the next section, you will learn about different AWS services used for ingesting data from on-premises servers to AWS.

Different ways of ingesting data from on-premises  into AWS

With the increasing demand for data-driven use cases, managing data on on-premises servers is pretty tough at the moment. Taking backups is not easy when you deal with a huge amount of data. This data in data lakes is used to build deep neural networks, create a data warehouse to extract meaningful information from it, run analytics, and generate reports.

Now, if you look at the available options to migrate data into AWS, this comes with various challenges too. For example, if you want to send data to S3, then you have to write a few lines of code to send your data to AWS. You will have to manage the code and servers to run the code. It has to be ensured that the data is commuting via the HTTPS network. You need to verify whether the data transfer was successful. This adds complexity as well as time and effort challenges to the process. To avoid such scenarios, AWS provides services to match or solve your use cases by designing a hybrid infrastructure that allows data sharing between the on-premises data centers and AWS. You will learn about these in the following sections.