What is Cloud Dataflow?
What are some Use Cases for Cloud Dataflow?
What are the advantages to using Cloud Dataflow?
Particularly for out-of-order processing and real-time session management.
1) Accelerate development for batch and streaming: provides simple pipeline development via Java and Python API’s in the Apache Beam SDK, code reuse.
2) Simplify operations and management: serverless approach removes operational overhead.
3) Integrates with Stackdriver for a unified logging and monitoring and alerting solution.
4) Builds on a foundation for machine learning, by adding ML models and API’s to your data processing pipelines.
5) Integrates seamlessly with GCP services for streaming events ingestion (Pub/Sub), data-warehousing (BigQuery), machine learning (Cloud Machine Learning), and more. Beam-based SDK allows customer extenstions for alternative execution engines like Apache Spark via Cloud Dataproc or on-premises.
What are features of Cloud Dataflow?
What are the different types of Triggers provided by Dataflow?
What are the two basic primitives in Dataflow?
What does Triggering provide for Windowing and Watermarks?
What are some differences between Dataflow and Spark?
What are the four major concepts when you think about data processing with Dataflow?
What is a Pipeline?
A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence, and produces some output data. That output data is often written to an external data sink. The input source and output sink can be the same, or they can be of different types, allowing you to easily convert data from one format to another.
Each pipeline represents a single, potentially repeatable job, from start to finish, in the Dataflow service.
What is a PCollection?
A PCollection represents a set of data in your pipeline. The Dataflow PCollection classes are specialized container classes that can represent data sets of virtually unlimited size. A PCollection can hold a data set of a fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub).
PCollections are the inputs and outputs for each step in your pipeline.
PCollections are optimized for parallelism, unlike the standard JDK Collection class.
What is a Transform?
A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection.
Your transforms don’t need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.
In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.
Transforms in the Dataflow model can be nested—that is, transforms can contain and invoke other transforms, thus forming composite transforms.
What are I/O Sources and Sinks?
The Dataflow SDKs provide data source and data sink APIs for pipeline I/O. You use the source APIs to read data into your pipeline, and the sink APIs to write output data from your pipeline. These source and sink operations represent the roots and endpoints of your pipeline.
The Dataflow source and sink APIs let your pipeline work with data from a number of different data storage formats, such as files in Google Cloud Storage, BigQuery tables, and more. You can also use a custom data source (or sink) by teaching Dataflow how to read from (or write to) it in parallel.
What are the Dataflow SDK options and advantages?
For SDK 2.x:
Language: Java and Python
Automatic patches and updates by Google
Issue and new version communications
Tested by Google
Eclipse Plugin Support
What questions should you consider when designing your pipeline?
What is the basic design behind a Pipeline?
A simple pipeline is linear: Input > Transform(PCollection) > Output
But can be much more complex, a pipeline represents a Directed Acyclic Graph of steps. It can have multiple input sources, multiple output sinks, and its operations (transforms) can output multiple PCollections.
It’s important to understand that transforms do not consume PCollections; instead, they consider each individual element of a PCollection and create a new PCollection as output. This way, you can do different things to different elements in the same PCollection.
What are the parts of a Pipeline?
A pipeline consists of two parts:
How does Dataflow perform data encoding?
When you create or output pipeline data, you’ll need to specify how the elements in your PCollections are encoded and decoded to and from byte strings. Byte strings are used for intermediate storage as well reading from sources and writing to sinks. The Dataflow SDKs use objects called coders to describe how the elements of a given PCollectionshould be encoded and decoded.
You typically need to specify a coder when reading data into your pipeline from an external source (or creating pipeline data from local data), and also when you output pipeline data to an external sink.
You set the Coder by calling the method .withCoderwhen you apply your pipeline’s Read or Write transform.
What are the PCollection limitations?
What is meant by bounded and unbounded PCollections, and what data-sources/sinks create and accept them?
What is Dataflow’s default Windowing behavior?
Dataflow’s default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must set a non-global windowing function.
If you don’t set a non-global windowing function for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your Dataflow job will fail.
You can alternatively set a non-default Trigger for a PCollection to allow the global window to emit “early” results under some other conditions. Triggers can also be used to allow for conditions such as late-arriving data.
What I/O API’s (read and write transforms) are available in the Dataflow SDK’s?
What security does Dataflow provide to keep your data secure and private?
How is a Datflow Pipeline constructed, what are the general steps?
————– code example ———————————————
// Start by defining the options for the pipeline. PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline. Pipeline p = Pipeline.create(options);
// Read the data
PCollection<string> lines = p.apply(<br> TextIO.Read.named("ReadMyFile").from("gs://some/inputData.txt"));</string>// Apply transforms PCollection<string> words = ...;<br>PCollection<string> reversedWords = words.apply(new ReverseWords());</string></string>
// Write out final pipeline data
PCollection<string> filteredWords = ...;<br>filteredWords.apply(TextIO.Write.named("WriteMyFile").to("gs://some/outputData.txt"));</string>// Run the pipeline p.run();