Processing real-time data using Kinesis Data Streams – AWS Services for Data Migration and Processing – MLS-C01 Study Guide

Processing real-time data using Kinesis Data Streams

Kinesis is Amazon’s streaming service and can be scaled based on requirements. It has a level of persistence that retains data for 24 hours by default or optionally up to 365 days. Kinesis Data Streams is used for large-scale data ingestion, analytics, and monitoring:

  • Kinesis streams can be ingested by multiple producers and multiple consumers can also read data from the streams. The following is an example to help you understand this. Suppose you have a producer ingesting data to a Kinesis stream and the default retention period is 24 hours, which means data ingested at 05:00:00 A.M. today will be available in the stream until 04:59:59 A.M. tomorrow. This data won’t be available beyond that point, and ideally, it should be consumed before it expires; otherwise, it can be stored somewhere if it’s critical. The retention period can be extended to a maximum of 365 days, at an extra cost.
  • Kinesis can be used for real-time analytics or dashboard visualization. Producers can be imagined as a piece of code pushing data into the Kinesis stream, and it can be an EC2 instance, a Lambda function, an IoT device, on-premises servers, mobile applications or devices, and so on running the code.
  • Similarly, the consumer can also be a piece of code running on an EC2 instance, Lambda function, or on-premises servers that know how to connect to a Kinesis stream, read the data, and apply some action to the data. AWS provides triggers to invoke a Lambda consumer as soon as data arrives in the Kinesis stream.
  • Kinesis is scalable due to its shard architecture, which is the fundamental throughput unit of a Kinesis stream. What is a shard? A shard is a logical structure that partitions the data based on a partition key. A shard supports a writing capacity of 1 MB/sec and a reading capacity of 2 MB/sec. 1,000 PUT records per second are supported by a single shard. If you have created a stream with three shards, then 3 MB/sec write throughput and 6 MB/sec read throughput can be achieved, and this allows 3,000 PUT records. So, with more shards, you have to pay an extra amount to get higher performance.
  • The data in a shard is stored via a Kinesis data record and can be a maximum of 1 MB. Kinesis data records are stored across the shard based on the partition key. They also have a sequence number. A sequence number is assigned by Kinesis as soon as a putRecord or putRecords API operation is performed so as to uniquely identify a record. The partition key is specified by the producer while adding the data to the Kinesis data stream, and the partition key is responsible for segregating and routing the record to different shards in the stream to balance the load.
  • There are two ways to encrypt the data in a Kinesis stream: server-side encryption and client-side encryption. Client-side encryption makes it hard to implement and manage the keys because the client has to encrypt the data before putting it into the stream and decrypt the data after reading it from the stream. With server-side encryption enabled via AWS KMS, the data is automatically encrypted and decrypted as you put the data and get it from a stream.

Note

Amazon Kinesis shouldn’t be confused with Amazon SQS. Amazon SQS supports one production group and one consumption group. If your use case demands multiple users sending data and receiving data, then Kinesis is the solution.

For decoupling and asynchronous communications, SQS is the solution, because the sender and receiver do not need to be aware of one another.

In SQS, there is no concept of persistence. Once the message is read, the next step is deletion. There’s no concept of a retention time window for Amazon SQS. If your use case demands large-scale ingestion, then Kinesis should be used.

In the next section, you will learn about storing the streamed data for further analysis.