Compute & Distributed Processing Flashcards by O Cam

What is the main goal of distributed data processing frameworks?

To process large datasets by splitting work across multiple machines so that jobs complete faster than on a single node.

How well did you know this?

Not at all

Perfectly

What is data parallelism in distributed processing?

Running the same operation on different partitions of the data in parallel across multiple workers.

How well did you know this?

Not at all

Perfectly

What is task parallelism in distributed processing?

Running different tasks or stages in parallel, possibly on the same or different data, to increase throughput.

How well did you know this?

Not at all

Perfectly

What is the basic idea of MapReduce?

Apply a map function to transform records into key-value pairs, then a reduce function to aggregate values for each key across the cluster.

How well did you know this?

Not at all

Perfectly

Why was MapReduce historically important for big data?

It provided a simple programming model and fault-tolerant execution over large clusters for batch jobs on file systems like HDFS.

How well did you know this?

Not at all

Perfectly

What are the main limitations of classic MapReduce?

Disk-heavy execution with repeated writes between stages, limited expressiveness for complex DAGs, and relatively high latency.

How well did you know this?

Not at all

Perfectly

What is a DAG (Directed Acyclic Graph) in modern data engines?

A graph of computation stages where nodes are operations and edges represent data dependencies, with no cycles.

How well did you know this?

Not at all

Perfectly

Why do engines like Spark use DAGs instead of fixed map/reduce stages?

DAGs allow more complex multi-stage workflows and better global optimization across operations.

How well did you know this?

Not at all

Perfectly

What is an execution engine in a distributed system?

The component that schedules tasks, coordinates workers, manages data locality, and handles retries and failures for jobs.

How well did you know this?

Not at all

Perfectly

What is data locality and why does it matter?

Processing data close to where it is stored to minimize network I/O and improve performance.

How well did you know this?

Not at all

Perfectly

What is Apache Spark conceptually?

A distributed data processing engine that supports batch, streaming, SQL, and ML workloads using DAG-based execution and in-memory computation.

How well did you know this?

Not at all

Perfectly

What is an RDD in Spark?

A Resilient Distributed Dataset representing an immutable, partitioned collection of records that can be processed in parallel.

How well did you know this?

Not at all

Perfectly

Why did Spark introduce higher-level APIs like DataFrames?

DataFrames provide schema-aware, declarative operations that allow the optimizer to generate more efficient execution plans than RDD code.

How well did you know this?

Not at all

Perfectly

What is a DataFrame in Spark-like systems?

A distributed, tabular data structure with named columns and a schema, similar to a table in a relational database.

How well did you know this?

Not at all

Perfectly

Why are DataFrames generally preferred over raw RDDs for most workloads?

They enable query planning, optimization, and code generation, often yielding better performance with less boilerplate code.

How well did you know this?

Not at all

Perfectly

What is lazy evaluation in Spark or similar engines?

Transformations build a logical plan but are not executed until an action is invoked, allowing optimizations before execution.

How well did you know this?

Not at all

Perfectly

What is a transformation vs an action in Spark-style APIs?

Transformations (e.g., map, filter) define new datasets from existing ones; actions (e.g., count, collect) trigger execution and return results.

How well did you know this?

Not at all

Perfectly

What is shuffle in distributed processing?

Study These Flashcards

A data movement stage where data is redistributed across the cluster, typically to group records by key for joins or aggregations.

Why are shuffles expensive?

Study These Flashcards

They involve network I/O, disk writes, and coordination across nodes, often becoming the bottleneck in distributed jobs.

How can you reduce the cost of shuffles?

Study These Flashcards

Reduce data volume before shuffles, pick join strategies carefully, partition data appropriately, and avoid unnecessary group-by operations.

What is a wide transformation?

Study These Flashcards

A transformation that requires data from many partitions, such as groupByKey, that typically triggers a shuffle.

What is a narrow transformation?

Study These Flashcards

A transformation like map or filter where each output partition depends on data from a single input partition, avoiding shuffles.

What is vectorized execution in query engines?

Study These Flashcards

Processing batches of values per column at a time using CPU-friendly operations, improving cache utilization and throughput.

Why is vectorized execution faster than row-at-a-time processing?

Study These Flashcards

It reduces per-row overhead, uses CPU instructions more efficiently, and benefits from modern CPU pipelines and caching.

What is the difference between an interpreter-based and codegen-based execution engine?

Interpreter-based engines interpret expressions dynamically; codegen-based engines compile expressions to bytecode or native code for faster execution.

How do modern engines combine columnar storage with vectorized execution?

They read column chunks into columnar in-memory structures and apply vectorized operators over these batches during query execution.

What is a worker node in a distributed cluster?

A machine that executes tasks assigned by the driver or master node and holds partitions of data in memory or disk.

What is a driver (or master) node?

The component that coordinates job execution, constructs DAGs, schedules tasks, and aggregates results from workers.

Why is fault tolerance necessary in distributed processing?

With many nodes, failures are common; fault-tolerant systems can recompute or recover data without losing the entire job.

How do RDDs provide fault tolerance?

By tracking the lineage of transformations so lost partitions can be recomputed from original data and operations.

What is checkpointing in distributed systems?

Persisting intermediate results to durable storage so recomputation on failure does not have to start from the very beginning.

Why is memory management critical in engines like Spark?

Too little memory leads to spilling to disk or out-of-memory errors; using memory efficiently is key to performance and stability.

What is spilling in distributed processing?

Writing intermediate data from memory to disk when memory is insufficient, typically during shuffles or aggregations.

Why can too many small partitions hurt performance?

They cause overhead in task scheduling, context switching, and bookkeeping, reducing efficiency.

Why can too few partitions also hurt performance?

Work may not be evenly distributed, and some tasks become hotspots while other cores sit idle.

What is a reasonable guideline for partition sizing?

Aim for enough partitions to keep all cores busy, with partition sizes large enough to amortize overhead, often tens to hundreds of MB per partition.

What is data skew in distributed processing?

An uneven distribution of data across partitions or keys, where some partitions or keys have much more data than others.

Why is data skew problematic?

Skewed partitions or keys create straggler tasks that delay job completion and may exhaust resources on specific workers.

What strategies help mitigate skew?

Key salting, pre-aggregation, splitting hot keys, or redesigning data and queries to distribute work more evenly.

What is a broadcast join?

A join strategy where a small table is sent to all workers so that each partition can join locally without a shuffle of both sides.

When is a broadcast join appropriate?

When one side of the join is small enough to fit in memory on each worker, significantly reducing shuffle cost.

What is a sort-merge join?

A join algorithm that sorts both inputs on the join key and then merges them, often used for large datasets when properly partitioned.

Why is understanding join strategies important for distributed processing?

Joins are common and expensive; choosing the right strategy significantly impacts performance and resource usage.

What is a good one-sentence mental model for distributed compute engines?

They break data into partitions, build an optimized DAG of operations, and schedule tasks across a cluster to transform partitions in parallel while minimizing shuffles and skew.

Compute & Distributed Processing Flashcards

(44 cards)