Developer Experience & Tooling Flashcards by O Cam

What is the main goal of developer experience (DX) in data engineering?

To make it easy, fast, and safe for engineers to build, test, deploy, and maintain data pipelines and models.

How well did you know this?

Not at all

Perfectly

Why is version control a core tool for data engineering?

It tracks changes to code, configs, and sometimes schemas, enabling collaboration, rollback, and auditability.

How well did you know this?

Not at all

Perfectly

What systems are commonly used for version control?

Git-based systems such as GitHub, GitLab, and Bitbucket.

How well did you know this?

Not at all

Perfectly

Why should SQL and configuration files be stored in version control?

SQL and configs define core pipeline logic; versioning them enables code review, history, and reproducible deployments.

How well did you know this?

Not at all

Perfectly

What is trunk-based development?

A workflow where developers integrate small, frequent changes into a main branch, reducing long-lived feature branches and merge pain.

How well did you know this?

Not at all

Perfectly

Why are small, frequent commits preferred over large infrequent ones?

They are easier to review, test, debug, and roll back if something goes wrong.

How well did you know this?

Not at all

Perfectly

What is continuous integration (CI)?

An automated process that builds and tests code whenever changes are committed or merged, catching issues early.

How well did you know this?

Not at all

Perfectly

What types of tests are common in CI for data engineering?

Unit tests for transformation logic, schema checks, linting, and sometimes small integration tests against test databases.

How well did you know this?

Not at all

Perfectly

What is continuous delivery (CD)?

Automatically preparing and sometimes deploying code changes to environments after CI passes, with minimal manual steps.

How well did you know this?

Not at all

Perfectly

Why is CI/CD useful for data pipelines?

It reduces manual deployment errors, speeds up iteration, and ensures changes are tested before reaching production.

How well did you know this?

Not at all

Perfectly

What is Infrastructure as Code (IaC)?

Managing infrastructure resources using versioned configuration files rather than manual ad-hoc changes.

How well did you know this?

Not at all

Perfectly

Why is IaC important in data platforms?

It makes environments reproducible, auditable, and consistent across dev, test, and prod, reducing configuration drift.

How well did you know this?

Not at all

Perfectly

What are common IaC tools used with data platforms?

Terraform, CloudFormation, Pulumi, and similar systems.

How well did you know this?

Not at all

Perfectly

What is configuration drift?

The gradual divergence of environments from their documented or intended configuration because of manual changes or inconsistent updates.

How well did you know this?

Not at all

Perfectly

How does IaC help prevent configuration drift?

By applying declarative definitions repeatedly, IaC tools reconcile actual state with desired state and highlight drift.

How well did you know this?

Not at all

Perfectly

What is environment parity?

The degree to which development, test, and production environments are similar in configuration and behavior.

How well did you know this?

Not at all

Perfectly

Why is environment parity important for data engineering?

Study These Flashcards

Bugs that only appear in production are harder to catch; parity ensures tests more closely reflect real conditions.

What is a feature flag in data systems?

Study These Flashcards

A configuration switch that enables or disables features or pipeline paths without redeploying code.

Why are feature flags useful for pipeline changes?

Study These Flashcards

They allow controlled rollout, quick rollback, and canary testing of new logic with minimal disruption.

What is a code review and why is it important for DE?

Study These Flashcards

A peer review of changes before merging, used to catch bugs, enforce standards, and share knowledge across the team.

What are common code review focus areas in DE?

Study These Flashcards

Correctness, performance, data quality, security, readability, and consistency with modeling and naming conventions.

What is linting in data engineering codebases?

Study These Flashcards

Static analysis that checks code for style, errors, or anti-patterns, often integrated into CI pipelines.

Why should SQL also be linted or validated?

Study These Flashcards

Consistent formatting and basic static checks help readability, prevent simple mistakes, and support automated tooling.

What is unit testing in data pipelines?

Study These Flashcards

Testing individual functions or transformations with small, deterministic inputs and expected outputs.

What is integration testing in data pipelines?

Testing how multiple components work together, such as end-to-end runs against a test database or storage layer.

Why is test data design critical for DE tests?

Realistic, targeted test data helps reveal edge cases, regression bugs, and performance issues that synthetic trivial data may miss.

What is a staging environment?

A pre-production environment that closely mirrors production, used for final testing before deployment.

Why is it dangerous to test changes directly in production?

Errors can corrupt real data, break downstream users, and violate SLAs or compliance requirements.

What is blue/green deployment in the context of DE systems?

Running old and new versions of pipelines or services in parallel, switching traffic from blue to green once the new version is validated.

What is canary deployment for data changes?

Rolling out a change to a small subset of data, users, or workloads first to detect issues before full rollout.

Why is observability integrated into developer workflows important?

Developers need immediate feedback on how code behaves in real environments to iteratively improve reliability and performance.

What is a runbook?

A documented set of steps for handling common operational scenarios, such as failures, rollbacks, and incident response.

Why should runbooks be kept with the code repository?

Versioned runbooks ensure operational procedures evolve alongside code and remain accessible to the team.

What is a local development environment for DE?

A setup on a developer’s machine or container that mimics key aspects of the production stack for development and testing.

Why can local development be challenging in DE compared to pure app dev?

It may involve large datasets, multiple external systems, and complex security constraints that are hard to fully replicate.

What strategies help with local DE development?

Using smaller test datasets, mocks or stubs for external systems, and containerized services where feasible.

What is a data engineer’s CLI or toolkit typically used for?

Running tasks, managing schemas, debugging pipelines, and interacting with storage or warehouses in a scripted, repeatable way.

Why is automation preferred over manual console work for recurring DE tasks?

Automation reduces errors, improves repeatability, and can be integrated into CI/CD and scheduling systems.

What is technical debt in data engineering code and pipelines?

Shortcuts or suboptimal designs taken to ship faster that make future changes harder, riskier, or more expensive.

Why is refactoring important over time in DE systems?

Refactoring improves structure, reduces duplication, and makes pipelines easier to maintain and extend as requirements change.

What is a monorepo vs multi-repo approach for DE code?

A monorepo stores many services and pipelines in one repository; multi-repo uses separate repositories per service or domain.

What is a good one-sentence mental model for DE developer experience and tooling?

Treat data pipelines like software: version everything, test and review changes, deploy with CI/CD and IaC, and give engineers fast, reliable feedback loops.

Developer Experience & Tooling Flashcards

(42 cards)