Amazon Kinesis Data Streams
Collect and store streaming data in real-time
1. Retention up to 365 days (can’t be deleted until it expires)
2. Data ordering guarantee
Kinesis Data Streams - Capacity Modes
Provisioned mode
On demand mode:
Amazon Data Firehose - AWS Destinations
Amazon Data Firehose
Firehose Buffer Sizing
Firehose accumulates records in a buffer and is flushed based on time and size rules
Amazon Managed Service for Apache Flink
Framework for processing data streams. Can’t read from Amazon Data Firehose.
Amazon Managed Streaming For Apache Kafka
Amazon MSK creates & manages Kafka brokers nodes.
* Deployed in your VPC, multi-AZ
* Data stores on EBS volumes
Kinesis Data Streams vs Amazon MSK
Kinesis Data Streams:
* 1 MB message size limit
* 12 months maximum retention
* Shard splitting and merging
Amazon MSK:
* Configure for bigger messages
* No retention limit
* Can only add partitions to a topic
AWS Batch
Run batch jobs as Docker images
AWS Batch - Multi node mode
Leverages multiple EC2 / ECS instances. One main node and multiple childs. Doesn’t work with Spot Instances.
Amazon Elastic MapReduce
EMR creates Hadoop clusters in a single AZ to analyze and process vast amount of data.
* Can access data in DynamoDB and S3
EMR File System
EMRFS stores persistent data in Amazon S3 while providing data encryption
EMR node types
EMR instance configuration
AWS Glue
Managed extract, transform and load service
AWS Glue Data Catalog
Data crawler writes metadata of databases and supportes storages to the data catalog
Redshift
Online analytical processing
* Columnar storage of data: easier to aggregate
* Massively Parallel Query Execution
* Not all clusters support multi AZ
Redshift nodes
Leader node: query planning and results aggregation
Compute node: performing queries and send results to leader
Redshift Enhanced VPC Routing
Copy and unload goes through VPC
Redshift snapshots
Point in time backups stored in S3
* Incremental
* Restore into a new cluster
* Automated or manual
* Copy automatically to another region
Redshift spectrum
Query data that is in S3 without loading it
* Must have a Redshift cluster to start the query
Redshift resource link
Data catalog object which is linked to a database, allows integrated services to run queries on the database. These services will not be able to access directly across cross accounts.
Redshift Workload Management
Flexibly manage queries priorities within workloads
* Multiple query queues
* Route queries to the appropriate queue
Redshift EVEN distribution
The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values. Is appropriate when a table doesn’t participate in joins