What is the purpose of Glue crawlers?
To populate the Glue Data Catalog with metadata on the data in S3
What does crawling with Glue Crawlers enable for Athena, Redshift and EMR?
Allows you to query your unstructured data using Athena, Redshift and EMR as if it was structured
What is ResolveChoice in Glue dynamic frames?
Allows you to deal with ambiguity in the dynamic frames, e.g. find a way to differentiate between two fields that have the same values in a field
What are dynamic frames in Glue?
An extension of Spark’s dataframes, a collection of records that have a schema. Used for semi-structured data
What does Hive allow you to do on EMR?
Run SQL-like queries from EMR
Can you modify the data catalog using a script to update the partitions or schema recorded there?
Yes, you can do this for certain data formats (JSON, CSV, Avro, Parquet) and if the data is in S3
What is a job bookmark in Glue ETL jobs?
A way to persist the state of the previous job run, allowing you to prevent the re-processing of old data
What is Glue Studio?
A visual interface for defining ETL workflows using DAGs
What is Glue Data Quality?
A feature within Glue Studio that allows you to perform an action based on an evaluation of the quality of your data, e.g. fail the whole job or report the results to CloudWatch
What is Glue DataBrew?
A visual data preparation tool for transforming your data with over 250 ready-made transformations
What should you create if you have a sequence of transformations you know you will want to re-use elsewhere in Glue DataBew?
A ‘recipe’
What are 3 ways to deal with PII in Glue DataBrew?
What is the purpose of Glue Workflows?
To design multi-job or multi-crawler workflows within AWS Glue
Does Lake Formation itself cost money?
No
What is the finest grain of access in Lake Formation?
Cell-level, using LF Data Filters
What is needed to do cross-account permissions in Lake Formation?
Set up the recipient as a data lake administrator
Use AWS RAM for accounts external to your organisation
Can Athena query unstructured data?
Yes - it can do structured, semi-structured and unstructured data
Does Athena support all of CSV, TSV, Avro, JSON?
Yes, it also supports Parquet and ORC (which are the obvious ones as they are columnar)
What is Athena workgroups?
Allows you to organise users, teams, apps and workloads into groups where you can control query access and track costs by group, as well as implement the amount of data that each group can scan and keep query histories
How do you pay for Athena?
Per TB scanned, for successful and cancelled queries but not failed queries
What are 2 tips for optimising performance with Athena?
1/ Use columnar formats such as Parquet and ORC
2/ Use a small number of large files instead of a large number of small ones
3/ Use partitions
What type of tables in Athena are ACID compatible?
Iceberg tables
What negative effect can ACID support have on your tables?
Can bloat them with lots and lots of data held to ensure consistency for all users - you should periodically compact your data
What regulatory use case is ACID useful for?
GDPR compliance