Model engineering Flashcards

Learning about model engineering (31 cards)

1
Q

What is CRISP-DM?

A

cross-industry standard process for data mining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it important to monitor model performance after deployment in model engineering?

A

It is important to track and evaluate model performance against dynamic data, and ensure the model remains relevant.

Real world data are dynamic and changing over time.

It also helps to identify model performance decay and give an early notification to the data science team in order to rebuild the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does five-number summary refer to in the context of data ingestion and EDA?

A
  • lower quartile
  • maximum
  • median
  • minimum
  • upper quartile

for numeric data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Please explain what makes TDSP an Agile-based methodology.

A

It is an iterative, adaptive and incremental process that allows the data science team looping back to previous steps at any stage of the project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are benchmark models used for in the context of model engineering?

A

baseline reference point for demonstrating to stakeholders whether the additional modeling process adds value, and whether the newly built model is worth putting into deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does HLP stand for in the context of model engineering?

A

Human Level Performance

the accuracy or error rate when the task is handled by human workforce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is TFX in the context of MLOps?

A

Tensorflow Extended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does TFT stand for in the context of MLOps?

A

Tensorflow transform

w/ data preprocessing components that can be embedded into a TFX project

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does TFMA stand for in the context of MLOps?

A

TensorFlow model analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two main types of deployment strategies when implementing scoring models through API serve?

A

batch scoring: model prediction and inference executed for large numbers of data points, triggered by a schedule or a specific event

real-time scoring: predictions served almost instantly after requests arrive from clients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does DVC stand for in the context of model engineering?

A

data version control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

List examples of what is included in model metadata.

A

information of model name, version ID, model registry location,
model input, and output directories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Please explain why model versioning is important to the model building process.

A

It ensures that changes to the models are tracked properly, and previously generated models are reproducible. It helps to increase the efficiency of team collaboration and allows continuous model experimentations and improvements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does Eclat/ECLAT stand for in the context of frequent itemset mining (association rule mining)?

A
  • Equivalence Class Transformation
  • Equivalence Class Clustering and bottom-up Lattice Traversal

Both expansions appear in the literature.

A depth-first search algorithm that uses vertical data format and set intersection to mine frequent itemsets efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does FP stand for in the FP-growth algorithm?

A

frequent pattern

In the first pass, the algorithm counts item frequencies and stores them in a header table. In the second pass, it builds a compressed FP-tree (a prefix tree) by inserting transactions in descending frequency order. Frequent itemsets are then mined by recursively constructing conditional FP-trees — without candidate generation (unlike Apriori). Proposed by Han, Pei & Yin (2000).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does SARSA stand for in the context of reinforcement learning?

A

state–action–reward–state–action

an algorithm for learning a Markov decision process policy

17
Q

Please explain how overfitting and underfitting are related to the concepts of bias and variance.

A

Overfitting results in high variance models, with low bias in training data and high bias in test data; underfitting results in low variance models, with high bias in both training and test data.

18
Q

What does LOOCV stand for in model engineering?

A

Leave One Out Cross Validation

19
Q

Please explain the advantages and disadvantages of using cross validation.

A

Cross validation has the advantage of generating more reliable model evaluation because it performs multiple iterations of training and testing on different portions of sample data. However, it has the disadvantage of being time-consuming.

20
Q

Which cross validation technique works best with classification models with unbalanced labels?

A

stratified K-fold cross validation

21
Q

How many iterations of model testing are involved in a K-Fold Cross Validation on a dataset with N observations?

22
Q

Please explain the reasons for building interpretable models.

A

Interpretable models help the debugging and reasoning process of model predictions. They yield more effectiveness in communication with stakeholders with different requirements. More importantly, building interpretable models helps to build trust with end users and decision makers by providing transparency and visibility of the model’s internal process.

23
Q

What does CAM stand for in the context of CNNs?

A

class activation map

24
Q

What does LIME stand for in the context of surrogate models in model engineering?

A

local interpretable model-agnostic explanations

25
What was the shapley value used to calculate originally?
the contribution of each player to the result of a cooperative game
26
What does a surrogate model do in the context of model engineering?
It approximates the process of a black-box model by learning an interpretable model on top of it.
27
Please explain the difference between global surrogate models and local surrogate models.
Global surrogate models provide a summarized understanding of a black-box model’s behavior, by approximating its overall prediction outcomes to the predictions generated by the black-box model. On the other hand, a local surrogate model focuses on the prediction of an individual case and its surrounding area, ensuring that the model is locally faithful to the black-box model.
28
Which pooling layer is used for class activation map?
Global average pooling
29
Please explain what bagging is and provide an example of an algorithm using bagging.
Bagging is a type of **ensemble model** that uses the **bootstrapping** technique to create random samples of dataset for training multiple models **in parallel**. Predictions from these models are **aggregated** together to generate the final prediction. Random forest is an example of a bagging algorithm which is widely used in industry applications. It uses decision trees as the weak learners and incorporates feature selection as a part of the model building procedure.
30
Please explain the difference between XGBoost and gradient boosting.
XGBoost is an implementation of the gradient boosting algorithm that enhances processing data at a larger scale and supports more advanced regularization capability. Compared to gradient boosting, XGBoost speeds up the modeling process by allowing parallel construction of multiple branches in each tree.
31
Please explain how stacking is different from bagging and boosting.
Bagging and boosting typically use one algorithm to train multiple weak learners whereas stacking combines a diverse range of algorithms to build the first-level learners. Besides, stacking additionally utilizes a meta-learner to fit a machine learning algorithm that combines these first-level learners together.