Blog

Data Streaming in AWS: Too Many Choices

If you have dealt with a cloud data streaming use case, you have more than likely opened more browser tabs than you’d care to admit while researching the various AWS options available and their, shall we say, lengthy FAQs.

Who has the time to read fifty pages of documentation just to understand which service to pick?

Listed below are several of the more prominent AWS services that all deal with ingesting, transforming, storing, and analyzing streaming data. By the time you are reading this blog post, it may no longer be comprehensive:

Kinesis Data Streams
Kinesis Firehose with optional Lambda integration
Kinesis Data Analytics
Managed Streaming for Apache Kafka (MSK)
Spark Streaming with Elastic MapReduce (EMR)
Glue Streaming ETL
IoT Analytics within the IoT Core

If you find yourself frustrated or spending too much time determining which option to go with, stay with me as I clarify the confusion by covering the fundamentals and ideal use-case for each service.

A brief summary of which to choose under which circumstances is provided at the end of the article.

Kinesis: The Default Choice

Amazon’s premiere choice for data streaming, and the service which you should consider as the default choice for most use-cases, is going to be Kinesis. However, within Kinesis there are a variety of sub-services with esoteric names, the most relevant of which are:

Kinesis Data Streams (often referred to as Streams)
Kinesis Firehose (or just Firehose), and
Kinesis Data Analytics (or just Analytics)

Let’s dive into each of these in detail.

Kinesis Data Streams vs. Kinesis Firehose

Streams vs. Firehose: Feature Overview

Both Data Streams and Firehose ingest data and dump that data into a sink. Why, then, are there two different services that seem to perform the same function?

The main difference between Streams and Firehose is that Streams is intended to send data to compute services with custom consumers, such as applications running on EC2, EMR, or Lambda, that will handle the transformation and processing of data in near real-time with as little as a ~70 ms delay. This makes Streams most useful for supporting real-time dashboards, anomaly detection, and similar time-sensitive applications. Streams integrate nicely with Apache Spark, simplifying real-time data manipulation via streaming Data Frames when more complex analytics are involved.

Meanwhile, Firehose is not intended for near real-time delivery. Rather, it will batch incoming messages, optionally compress and/or transform them with AWS Lambda, and then sink data, usually into an AWS service. This will typically be S3, Redshift, or Elasticsearch.

While Stream messages are usually consumed by custom applications, you can also configure Streams to flow data into a Firehose stream, thus enabling both real-time analytics and batching and storage of data for long-term retention.

If your use case does not demand near real-time processing, Firehose is likely to be more ideal, as well as easier to work with directly.

Firehose: Usually the Better Choice

Why exactly is Firehose the more ideal choice among these two services?

First off: Kinesis Streams requires more extensive coding effort via programs written with its Java-centric Kinesis Producer Library (KPL) and Kinesis Consumer Library (KCL). Firehose, in contrast, is designed mostly to sink data into specific AWS services, meaning no coding effort is involved in the sink component. Publishing messages to Firehose is also easy:

import boto3

firehose_client = boto3.client('firehose')
response

 firehose_client

put_record(
    DeliveryStreamName

'string',
    Record

{'Data': b'bytes'} # base64-encoded
)

If you are OK with Kinesis Streams likely suffering performance benefits on both the producer and consumer side of things, as well as losing out on other benefits, you can use the AWS SDKs to more simply publish messages rather than engage in more extensive KPL-based development:

import boto3

kinesis_client = boto3.client('kinesis')
response

 kinesis_client

put_record(
    StreamName

'string',
    Data

b'bytes', # base64-encoded
    PartitionKey

'string'
)

Secondly: Both Streams and Firehose are fully managed and auto-scaling, but Streams is ‘less’ auto-scaling and not quite serverless.

Firehose auto-scaling allows you to immediately and seamlessly ramp up throughput from development testing to GBs of data per second with no hiccups, assuming your Firehose throughput AWS limits are not being hit.

Streams scaling is more involved. Although you do not directly manage the underlying infrastructure, you must define for a Kinesis Stream a quantity of ‘shards’ that translates into that stream’s supported throughput. One shard can maximally equate to 1 MB/s or 1,000 records / s of write throughput, and 2 MB /s of read throughput. Thus, a certain quantity of ‘shards’ must be pre-provisioned in order to support a given level of throughput. You can manually change a stream’s shard count or you can set up auto-scaling, however, the latter process is more convoluted than it ought to be.

Thus to achieve higher throughput, you scale a stream’s shard count. There is a big caveat, however: One shard takes on average about 30 seconds to add or remove from a stream, and only one shard can be added or removed at a time.

Let’s look at how this could impact a real-world example: If you had a stream with 1,000 shards (~1 GiB/s write throughput) and you anticipate needing to double your throughput in the near future, it will take over 8 hours for your stream to fully scale-up via 1,000 additional shards.

If you have reason to believe you may one day be faced with a sudden and unexpected spike in data volume, Kinesis Data Streams will not be able to scale quickly enough.

With this in mind, Streams is not suitable for use cases with high variability in streaming throughput unless you are willing to accept over-provisioning and over-paying for such occasions, or if you are confident in your ability to anticipate peaky scaling events and plan accordingly. Firehose is simply easier to scale, easier to develop for, and, as we will soon see, easier to price out. Kinesis Streams should be used only when real-time analytics are required.

Streams vs. Firehose: Pricing Complexity Comparison

Firehose’s serverless and auto-scaling approach to data streaming means it also has a straightforward pay-as-you-go pricing scheme that is based on:

GBs of data ingested per month, and for the relevant use cases:
GBs of data format conversions performed
GBs of data delivered to a VPC

Data Streams pricing is more complex to forecast, given that it is based on:

The number of shard-hours provisioned, which can vary with auto-scaling. You will likely have to over-provision to ensure reliable uptime.
The number of 25 KB PUT payloads sent
Optional long-term data retention changes
Optional enabling of enhanced fan-out data retrieval, a feature that improves throughput when many consumers read from the same shard

Streams vs. Firehose: Summary

If your aim is to perform basic data transformations and batch loading of your streaming data into a data store, and you don’t have a real-time processing requirement, you should publish your data straight to Firehose. You should also choose Firehose if you want to minimize time spent on app development or minimize rapid infrastructure scalability concerns.

If you need to process data in real-time, send it through Streams, but know that even with auto-scaling enabled it may not handle peaky throughput increases well enough.

If you need to process data in real-time as well as store it later for analytics, you can send your data to Streams first, then through the Kinesis web console easily configure Streams to forward that data on to Firehose with no coding required (unless optional transformation steps run on Lambda are desired).

It may help to think of Streams as functionally similar to Apache Kafka, but with temporary persistent storage of messages that can be forwarded onto many consumers. Consumers can be custom applications (EC2 or EMR), or AWS managed (Firehose). Firehose is often treated as a batch loader for specific services, usually AWS-centric (S3, Redshift, Elasticsearch) with optional, Lambda-powered serverless data transformations.

Kinesis Data Analytics: Serverless Window Analytics

Streams and Firehose pretty well cover the ingestion, transformation, and shuttling streaming data into real-time analysis applications (Streams) and long-term storage sinks (Firehose). What role does Analytics fill then?

Data Analytics, Amazon’s fully-managed and serverless Apache Flink offering:

Integrates with either Data Streams or Firehose
Runs SQL queries against that streaming data, and
Streams the results on to an AWS service such as another Data Stream, Firehose stream, or Lambda
Data Analytics can also treat static CSV or JSON files located in S3 as SQL tables, enabling JOINs of reference data with streaming data.

Data Analytics is primarily used to continually calculate aggregations of streaming data, annotated with static reference data, over windows of time — usually for real-time alerting purposes — without any code or provisioned infrastructure, just straightforward SQL. You could write Flink code and deploy that instead of SQL, but for ease of maintenance I would stick with SQL given that Scala & Java are less commonly used in the data science world.

The AWS docs showcase an example of sliding window analytics using stock ticker data, but I would instead recommend studying this more interesting real-world example where traffic speeds throughout Belgium are ingested into a Firehose stream, Data Analytics is used to compare current vs. past traffic speeds with the help of reference data in S3, the presence of traffic jams is determined with SQL, and real-time alerts are pushed with Lambda.

Managed Streaming for Apache Kafka (MSK)

Both Kinesis Data Streams and MSK, the AWS-managed offering for Apache Kafka, are effective ‘pub-sub’ systems that enable the publishing and consumption of messages with high throughput, low latency, high availability, and fault tolerance. In general terms of scalability and reliability as a data ingestion and delivery platform, there isn’t too much difference between the two, at least on the surface.

There are critical differences to be found in the details, however, and they generally favor Kinesis:

You can only increase the number of message brokers. This means you cannot scale down an MSK deployment.
While MSK is fully managed, it is not serverless Kafka. Thus, MSK involves some cluster configuration. You must define zones and subnets into which your brokers will launch, the number of brokers to use per zone, the instance type powering your brokers, and so on.
Given that MSK is not serverless, this means you also pay for the EBS-based storage backing instances. This too cannot be scaled down.
You cannot change the instance type after the initial cluster setup. Your only option for scaling up is to increase instance counts.
Kinesis is AWS’ in-house fully-managed and serverless streaming service, so it is naturally going to have better integration with AWS services. Some Kinesis consumers can be connected with no code, while MSK requires all consumer applications to be custom-built, e.g. on EC2, EKS, EMR, or with Flink code deployed on Kinesis Data Analytics.
You can only have one Kafka consumer in a given consumer ‘group’ read from a partition at a time. Kinesis on the other hand supports multiple consumers per shard.

Initial and ongoing cluster configuration, combined with the inability to scale down deployments, means that there will be more short- and long-term DevOps overhead incurred by using MSK compared to Kinesis.

Pricing is also different between the two, and again generally favors Kinesis.

As mentioned earlier, Kinesis Data Streams pricing is largely on-demand. It is mostly based on the number of 25 KB PUT payloads sent and the number of shard-hours provisioned to enable your desired PUT and GET throughput. Shard counts can be configured to be auto-scaling so that this also simulates on-demand pricing, although the implementation for this is a little clunky.

MSK on the other hand is priced based on how many instances of a given instance size you have chosen are running, as well as the EBS volume sizes backing them. Neither instance count nor EBS volume size can be downsized; if you over-provision, you are stuck with the bill unless you terminate the cluster. Storage can be set to auto-scale, but instance sizes and counts cannot. You will instead have to continuously keep track of instance CloudWatch metrics, how many partitions are being used per broker, and other performance indicators, then scale the service manually in response. To top it all off, CPU usage in the cluster is ‘strongly recommended’ to remain under 60% so you will inevitably over-pay for compute capacity.

There is one major perk to using MSK: Kinesis offers at-least-once message delivery, while Kafka guarantees exactly-once delivery. Generally, though, de-duplication of messages is much easier to deal with than challenges around cost-effective infrastructure scalability.

Even if you are able to successfully operate an MSK cluster at a very high-through scale that is cost-optimized and managed to price the deployment out to be cheaper than Kinesis, I would bet it is likely you would still save more using Kinesis due to the reduced DevOps (high salary) man-hours spent on keeping a business-critical pub-sub service running when a fully-managed equivalent would have done just as well with virtually no effort.

I personally only recommend AWS MSK for companies that have an existing Kafka-based application that, due to either time, refactor cost, or employee resource constraints, must be lift-and-shifted with no architectural changes.

If you are interested in learning more, I recommend reading this “honest AWS MSK review”. Comments on the article are also insightful and focus on MSK vs. Kinesis pricing examples.

Spark Streaming with EMR

AWS EMR is Amazon’s fully-managed and auto-scaling (but not serverless) offering that enables cluster-based execution of scripts written for open-source big data processing tools. Included among these tools: Apache Spark.

Spark enables DataFrame-based analytics that can be run on static as well as streaming datasets. You can perform analytics on DataFrames with typical programmatic function calls, or you can run ANSI SQL-compliant Spark SQL. Working with Spark SQL on streaming data is similar to using SQL with Apache Flink / Kinesis Data Analytics. Spark can take as a monitored streaming ingestion source Apache Kafka and Apache Flume, as well as AWS S3 and AWS Kinesis Data Streams.

Given Spark’s excellent built-in integrations with Streams and S3, as well as its low learning curve with PySpark, I recommend using Spark Streaming with DataFrames on EMR when you need to:

Engage in real-time analytics on Kinesis Data Streams or S3, and
When the scale or complexity of the analytics exceeds some of the many limitations of Kinesis Data Analytics.

Most critically, you should be aware that with Data Analytics:

No row of data can be >512 KBs. Spark’s limit is much higher: 2 GBs.
Your reference dataset exceeds 1 GB in size. Spark has no limit.
Each application must have exactly one streaming source and up to one reference data source. With Spark, you can join together multiple streaming sources and multiple static reference data sources.
Windowed queries should not exceed 60 minutes as data is stored in volatile storage from which the stream may be re-built if there are unexpected application interruptions. Spark has no time window limit.

Glue Streaming ETL

Glue Streaming is a fully-managed, auto-scaling, and serverless Spark Streaming DataFrames offering, so you would use this if you are experienced with Spark and want to engage in custom transformation and analytics on data streaming from Kinesis with this service rather than with a self-managed EMR cluster or Lambda functions. Glue Streaming can sink to the usual data sinks such as S3, Redshift, and DynamoDB.

Glue can to some extent auto-generate Spark code for you based on a list of transformations you ask for in the web console, so a lot of Spark experience isn’t necessarily required, although it is helpful to have a handle on the basics.

I have personally found this service to be a bit finicky. While serverless Spark sounds convenient, it comes with some limitations:

When using schema detection, you cannot perform joins of streaming data.
You cannot change the number of Kinesis Streams shards while a Glue Streaming ETL job is running. You have to stop the job, change your Data Streams shard count and wait for that operation to complete, and then restart the job.

For upstream scalability alone I would personally choose to run a self-managed Spark cluster with auto-scaling enabled over Glue Streaming ETL, but in your testing, you may find that the auto-scaling and serverless nature of Glue Streaming ETL outweigh its cons.

IoT Analytics within the IoT Core

What the IoT Analytics components do is not as transparent as it should be — so I wrote an entire article on it! Production-Scale IoT Best Practices: Implementation with AWS (part 2). Let’s cover the basics here and leave many of the details for that article.

IoT devices streaming into AWS will hit the IoT Core. From here, those messages can be sent to other services for custom analytics. For example, with IoT Rules you can easily forward IoT Core data into:

DynamoDB
Firehose, which then dumps data to DynamoDB, S3, Redshift, or Elasticsearch
Data Streams, which can send data to EC2, EMR, Lambda, or Firehose

As you can see, there’s a wide range of possible IoT dataflows you can set up.

However, if you want to process all your IoT data, from streaming into your platform through the storage and analytics, completely within a unified IoT-centric platform that is entirely serverless, auto-scaling, and fully managed, and which has multiple analytics-related integrations with other AWS services such as Sagemaker and Quicksight, that’s when you would stick to using IoT Analytics. The IoT core allows you to process IoT data within a single service rather than piecing together multiple services.

Clicking through the IoT Analytics wizard will set up the following:

An IoT Channel is where the IoT data arrives
An IoT Pipeline that takes Channel data and allows you to optionally enrich, transform, and filter messages based on their attributes
An IoT Data Store, where streaming data is stored, either indefinitely or for a specified period of time. Behind the scenes, this data is being stored in an AWS-managed S3 bucket.
An IoT Data Set can be created from a Data Store. This is a subset of an IoT Data Store created with IoT SQL which possesses its own data retention period as well as the ability to re-create itself on-demand or on a recurring schedule. Like Data Stores, Data Sets are stored as CSV files in a managed bucket. Data Sets ultimately mean that you can create a static data set based on a custom filter (for example, select all temperature data from a narrow yet interesting time frame), generate that data set on-demand once, and retain that filtered data set indefinitely for downstream analytics while allowing the original, raw data store messages to expire according to a retention period that is deemed by your organization to most effectively balance raw data retention against cost-effectiveness.
Some AWS services that integrate with IoT Analytics such as Quicksight will only draw from data sets, while other services such as SageMaker can draw from both data stores and data sets. Generally, all services should be able to draw from data sets. Due to the more limited connectivity with data stores and the cost implications of storing raw data indefinitely in IoT Analytics, in a production use-case, you should consider becoming accustomed to the methodology of creating discrete, filtered datasets to be used in analytics or ML model generation, with the raw data store expiring over time unless your organization deems it acceptable to pay for the full history storage of IoT data.

Which Service Do I Use?

In my effort to concisely explain the various AWS streaming options, I fear I have written too much! Here is a quick summary of when to pick a service:

Kinesis Data Streams: When you need to perform real-time analytics on EC2, EMR, or Lambda and don’t mind some added development complexity and its inability to rapidly scale throughput.
Kinesis Firehose: When you need to batch streaming data, optionally transform and/or compress it, and place it into long-term storage on S3, Redshift, or Elasticsearch. When you want an easy-to-use, serverless, and immediately auto-scaling streaming data ingestion platform and don’t mind that it will batch write data rather than send messages to consumers in real-time.
Kinesis Data Analytics: When you want to perform basic windowed analytics on Data Streams or Firehose data, typically for real-time alerting, with SQL on a simple, serverless, auto-scaling platform.
Managed Streaming for Apache Kafka (MSK): When you have an existing Kafka-based application and seek to lift-and-shift into AWS. Time or resource constraints prevent you from re-designing an application to use Kinesis.
Spark Streaming with EMR: When you need to perform advanced windowed analytics on Kinesis Data Streams via JOIN operations involving multiple streaming sources and/or static reference datasets.
Glue Streaming ETL: Similar to Spark Streaming with EMR, except that you can run Spark workloads within a serverless, auto-scaling environment. Does not allow for upstream Data Streams shard scaling without stopping and restarting connected Glue streaming jobs.
IoT Analytics within the IoT Core: An all-in-one ingestion, storage, and analytics platform for streaming IoT data that is fully managed, serverless, and auto-scaling. The IoT Core enables you to avoid piecing together separate AWS services to accomplish the same functionality.

Thanks for reading! To stay connected, follow us on the DoiT Engineering Blog, DoiT Linkedin Channel, and DoiT Twitter Channel. To explore career opportunities, visit https://careers.doit.com.

Subscribe to updates, news and more.

Blog

Data Streaming in AWS: Too Many Choices

Kinesis: The Default Choice

Kinesis Data Streams vs. Kinesis Firehose

Streams vs. Firehose: Feature Overview

Firehose: Usually the Better Choice

Streams vs. Firehose: Pricing Complexity Comparison

Streams vs. Firehose: Summary

Kinesis Data Analytics: Serverless Window Analytics

Managed Streaming for Apache Kafka (MSK)

Spark Streaming with EMR

Glue Streaming ETL

IoT Analytics within the IoT Core

Which Service Do I Use?

Subscribe to updates, news and more.

Related blogs

Company

Offering

Support

Never miss an update.

Subscribe to updates, news and more.

Connect With Us