Analytics Flashcards

1
Q

Kinesis

A
  • real time data processing service that continuously captures( and stores) large amounts of data that can power realtime streaming dash boards.
  • Using the AWS provided ADKs, you can create real-time dashboards, integrate dynamic pricing strategies, and export data from kinesis to other AWS services.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis export data to other services

A

EMR;
S3;
RedShift;
Lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Kinesis Components

A

Stream
Producers(data creators)
Consumers(data consumers)
Shards(processing power)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Kinesis Benefits

A
  • Realtime processing – continuously collect and build applications that analyze the data as it’s generated.
  • Parallel Processing – Multiple Kinesis applications can be processing the same incoming data streaming concurrently.
  • -Durable – Kinesis synchronously replicates the streaming data across three data centers within a single AWS region and preserves the data for up to 24 hours.
  • -Scales - can stream from as little as a few megabytes to several terabytes per hour.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Kinesis When to use

A
Gaming;
Real-time analytics;
Application alerts;
Log/Event data collection
Mobile data capture
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Kinesis– Producer

A
  • devices to collect data for kinesis processing.
    • continuously input data into a kinesis stream
    • include but not limited to: LOT sensors; Mobile Devices
    • the more data you want to process, the more “Shards” you add to your Kinesis Stream.
    • each shard can process 2MB of read data per second, and 1MB of write data per second.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kinesis Consumer

A
  • consume data, done concurrently
  • multiple consumers can consume the same data at the same time.
    • include: real-time dashboards; S3; Redshift; EMR
    • any application you careted can consume the streams data
    • keeps 24 hours of streaming data stored by default, but can be configured to store up to 7 days.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

EMR (Elastic MapREduce)

A

is a service which deploys out EC2 instances based off of the Hadoop big data framework.

    • used to analyze and process vast amounts of data.
    • Supports other distributed frameworks like Apache Spark, HBase, Presto, Flink
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

EMR Workflow

A
  • -Data stored in S3, DynamoDB, or Redshift is sent to EMR
  • -the data is mapped to a “cluster” of Hadoop Master/Slave node for precessing.
  • computations(coded/created by the developer) are used to process the data.
  • the processed data is then reduced to a single output set of return information.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Other EMR Facts

A
  • admin has the ability o access the underlying operating system.
    • you can add user data to EC2 instances launched into the cluster via bootstrapping.
    • EMR takes advantage of parallel processing for duster processing of data.
    • you can resize a running cluster at any time, and you can deploy multiple clusters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

EMR Master Node

A
  • -a node that manages the cluster by running software components which coordinate the distribution of data and tasks among other(slave) nodes for processing.
    • tracks the status of tasks and monitors the health of the cluster.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

EMR Slave Node

A

Core node and Task Node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Core Node

A

a slave node that software components which run tasks AND stores data in the Hadoop Distributed File System(HDFS) on your cluster.
– do the heavy lifting with the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Task Node

A

a slave node that has software components which only run tasks.
– optional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

EMR Map Phase

A
  • Mapping is a function that defines the processes which splits the large data file for precessing.
  • during mapping phase, the data is split into 128 MB”CHunks”;
    the larger the instance size used in our EMP cluster, the more chucks you can map and process at the same time.
  • if there are more chunks than nodes/mappers, the chunks will queue for precessing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

EMR reduce phase

A

reducing i sa function that aggregates the split data back into one data source.
reduced data needs to be stored as data processed by the EMR cluster is not persistent.