What is the main goal of distributed data processing frameworks?
To process large datasets by splitting work across multiple machines so that jobs complete faster than on a single node.
What is data parallelism in distributed processing?
Running the same operation on different partitions of the data in parallel across multiple workers.
What is task parallelism in distributed processing?
Running different tasks or stages in parallel, possibly on the same or different data, to increase throughput.
What is the basic idea of MapReduce?
Apply a map function to transform records into key-value pairs, then a reduce function to aggregate values for each key across the cluster.
Why was MapReduce historically important for big data?
It provided a simple programming model and fault-tolerant execution over large clusters for batch jobs on file systems like HDFS.
What are the main limitations of classic MapReduce?
Disk-heavy execution with repeated writes between stages, limited expressiveness for complex DAGs, and relatively high latency.
What is a DAG (Directed Acyclic Graph) in modern data engines?
A graph of computation stages where nodes are operations and edges represent data dependencies, with no cycles.
Why do engines like Spark use DAGs instead of fixed map/reduce stages?
DAGs allow more complex multi-stage workflows and better global optimization across operations.
What is an execution engine in a distributed system?
The component that schedules tasks, coordinates workers, manages data locality, and handles retries and failures for jobs.
What is data locality and why does it matter?
Processing data close to where it is stored to minimize network I/O and improve performance.
What is Apache Spark conceptually?
A distributed data processing engine that supports batch, streaming, SQL, and ML workloads using DAG-based execution and in-memory computation.
What is an RDD in Spark?
A Resilient Distributed Dataset representing an immutable, partitioned collection of records that can be processed in parallel.
Why did Spark introduce higher-level APIs like DataFrames?
DataFrames provide schema-aware, declarative operations that allow the optimizer to generate more efficient execution plans than RDD code.
What is a DataFrame in Spark-like systems?
A distributed, tabular data structure with named columns and a schema, similar to a table in a relational database.
Why are DataFrames generally preferred over raw RDDs for most workloads?
They enable query planning, optimization, and code generation, often yielding better performance with less boilerplate code.
What is lazy evaluation in Spark or similar engines?
Transformations build a logical plan but are not executed until an action is invoked, allowing optimizations before execution.
What is a transformation vs an action in Spark-style APIs?
Transformations (e.g., map, filter) define new datasets from existing ones; actions (e.g., count, collect) trigger execution and return results.
What is shuffle in distributed processing?
A data movement stage where data is redistributed across the cluster, typically to group records by key for joins or aggregations.
Why are shuffles expensive?
They involve network I/O, disk writes, and coordination across nodes, often becoming the bottleneck in distributed jobs.
How can you reduce the cost of shuffles?
Reduce data volume before shuffles, pick join strategies carefully, partition data appropriately, and avoid unnecessary group-by operations.
What is a wide transformation?
A transformation that requires data from many partitions, such as groupByKey, that typically triggers a shuffle.
What is a narrow transformation?
A transformation like map or filter where each output partition depends on data from a single input partition, avoiding shuffles.
What is vectorized execution in query engines?
Processing batches of values per column at a time using CPU-friendly operations, improving cache utilization and throughput.
Why is vectorized execution faster than row-at-a-time processing?
It reduces per-row overhead, uses CPU instructions more efficiently, and benefits from modern CPU pipelines and caching.