Test 1 W1-4 Flashcards

Question 1

Q

What is Deep Learning?

Answer

A

A set of machine learning algorithms that model high-level abstractions in data by using model architectures (often neural networks)

Question 2

Q

What is Machine Learning?

Answer

A

Let machines learn from data instead of writing programs from hand.

Provide examples that specify correct outputs for a given input and a machine learning algorithm takes the data and produces a program/ model that does the job (if done right, the program works for new cases as well, can change the program by training it on new data)

Question 3

Q

Given examples of inputs and corresponding desired outputs, predict outputs on future inputs

Answer

A

Supervised Learning

Question 4

Q

Given only inputs, automatically discover representations, features, structures, etc

Answer

A

Unsupervised Learning

Question 5

Q

Given sequences of inputs, actions from a fixed set, and scalar rewards/ punishments, learn to select sequences in a way that maximizes expected reward

Answer

A

Reinforcement Learning

Question 6

Q

Outputs are categorical; inputs are anything. Goal is to select correct class for new inputs.

Answer

A

Classification (Supervised Learning)

Question 7

Q

Outputs are continuous; inputs are anything (but usually continuous values). Goal is to predict outputs accurately for new inputs.

Answer

A

Regression (Supervised Learning)

Question 8

Q

-?- is used to scale the range of a feature

Answer

A

Feature Normalization

(need to normalize features so that the distance between two points is not governed by one particular feature with a broad range of value)

Question 9

Q

Many classifiers calculate the distance between two points by -?-

Answer

A

The Euclidean distance

Question 10

Q

-?- is often used as the method to find optimal solution, which converges much faster with feature scaling than without it

Answer

A

gradient descent

Question 11

Q

What are the three phases of classification tasks.

Answer

A

Data is split accordingly to be

Training set
Validation (aha development or held-out) data set (used for model selection/ averaging)
Test Set

Ration ofter 60:20:20

Question 12

Q

What is a solution for when data is limited?

Answer

A

n-fold cross validation

Question 13

Q

What is supervised learning?

Answer

A

Given examples of inputs and corresponding desired outputs, the system predicts outputs on future inputs.

Question 14

Q

List 3 examples of supervised learning.

Answer

A

Classification, Regression, Time Series Prediction

Question 15

Q

What is unsupervised learning?

Answer

A

Given only inputs, automatically discover representations, features, structure, etc.

Question 16

Q

What is classification?

Answer

A

Outputs are categorical, inputs are anything. Goal is to select correct class for new inputs.

Question 17

Q

What is Regression?

Answer

A

Outputs are continuous, inputs are anything (but usually continuous. Goal is to predict outputs accurately for new inputs.

Question 18

Q

What is an example of unsupervised learning?

Answer

A

Clustering.

Question 19

Q

What type of learning is KNN?

Answer

A

Supervised learning.

Question 20

Q

In KNN what if k is too small?

Answer

A

Over fitting occurs. A learning algorithm corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. (Overfit to noise)

Question 21

Q

In KNN what if k is too big?

Answer

A

Underfitting occurs.
Think about if k is equal to the total number of training data points, KNN does not learn much from the data it just becomes a majority classifier

Question 22

Q

What is leave-one-out validation?

Answer

A

When data scarce, may be appropriate to consider making the number of folds (cross validation) to be the number of data points

Question 23

Q

What is the K-Nearest Neighbot (KNN) ‘algorithm’?

Answer

A

A generative nonparametric classification model

Training: store all data in some way
Given a test point x, find a sphere around x enclosing k points
Classify x according to the majority of the k neighbors

Question 24

Q

What is the nearest neighbor classifier?

Answer

A

in KNN of K=1

Question 25

Q

How to solve overfitting? How to pick k?

Answer

A

We can use validation or cross validation to select model

Question 26

Q

Adv/ Dis of KNN?

Answer

A

Adv: Learning is cheap (just need to remember all data points)
Dis: - Prediction expensive (need to retrieve k nearest neighbors from a large set of N points, for each prediction made)
- in high dimensions, points are far away from each other (poor performance)

Question 27

Q

Nonparametric classification methods

Answer

A

aka instance-based/ case-based/ memory-based method. Number of model parameters grows with number of training cases/data points
(example is KNN)

Question 28

Q

Parametric classification methods

Answer

A

aka model-based methods

Number of parameters is fixed

Question 29

Q

What are two basic ideas for parametric models?

Answer

A

Linear classifiers

2. Decision Trees

Question 30

Q

Linear classifiers goal

Answer

A

To find the line (or hyperplane) which can “best” (under some criterion/ objective) separate two classes

Question 31

Q

Goal of Perceptron?

Answer

A

minimize an error function known as perceptron criterion (which associates a zero error with any data point correctly classified)

(Seeks a weight vector such that a pattern X in class C1 will have wTφ(xn) > 0 and in C2 will have wTφ(xn) < 0. When a data point is misclassified it’s feature vector is added to the current weight vector giving a new decision boundary. and the true class label ‘t’ takes the value +1 for C1 and -1 for C2)

Question 32

Q

Optimization algorithm, -?- is often applied to minimize the errors, which updates model parameters at each training data point.

Answer

A

stochastic gradient descent (SGD) (aka online algorithm)

Question 33

Q

What does the perceptron convergence theorem state?

Answer

A

if there exists an exact solution (if the training data is linearly separable) then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.

If the data is not linearly seperable, perceptron will not converge

Question 34

Q

What is the Decision Trees Alg

Answer

A

Parametric Model
- a classification tree, where (1) each internal node represents a "test" on variable/ feature; (2) each branch represents the outcome of the test, and (3) each leaf node has a class label

Question 35

Q

What is Decision Tree Training?

Answer

A

(Greedily) search features, choose one that splits data in order to reduce impurity most, which is measure with:

GINI
Entropy
Misclassification Errors

Question 36

Q

Generative vs. Discriminative models

Answer

A

Generative: Use Bayes' theorem to find the posterior class probability p(Ck|x).
(by sampling from the joint distribution possible to generate synthetic/ unseen data points in the input space)

Discriminative: Learn p(Ck|x) directly by just using a discriminant function to map each input x directly onto a class label, where probabilities play no role

Question 37

Q

Decision THeory

Answer

A

Used for both Generative/ Discriminative models to choose a class label.

Tells us how to make optimal decisions given the appropriate probabilities (minimize the chance of assigning x to the wrong class)

Question 38

Q

What is leave-one-out validation?

Answer

A

When data is scarce, consider making the number of folds to be the number of data points

Question 39

Q

What is an implementation issue of KNN?

Answer

A

Prediction is expensive. You need to retrieve k nearest neighbours from a large set of N points.

Question 40

Q

What is the goal of a decision tree?

Answer

A

Minimize expected sum of impurity at leaves.

Question 41

Q

Gaussian Classifier

Answer

A

A generative model (with continuous input)

Closed-form solution in which you obtain parameters using maximal likelihood estimation.
Need to manipulate mean vectors and co-variance matrixes when extending from univariate to multivariate gaussians

Question 42

Q

Naive Bayes Classifier

Answer

A

A generative model (with discrete input)

Question 43

Q

Naive Bayes Classifier Assumption?

Answer

A

conditioned on class, features/ variables are independent

Question 44

Q

How do we perform Naive Bayes on continuous input e.g.., in a Gaussian classifier?

Answer

A

Gaussian Bayes Classifier

Has a separate, diagonal covariance matrix for each class

Question 45

Q

logistic regression models are:

Answer

A

Discriminative

- have the same number of adjustable parameters as dimensions of the features space

Question 46

Q

For a large value of M, (in an M-dimensional feature space) what is the best model to use?

Answer

A

Logistic regression (over Gaussian classifiers)

Question 47

Q

What do we use to allow for direct calculations in an extended feature space?

Answer

A

use the “kernel trick”

Question 48

Q

What is the key idea of Kernel methods?

Answer

A

reduce an algorithm to one which depends only on dot products between data vectors. Then replace the dot product with a kernel function k(x, z)

Question 49

Q

a kernel machine

Answer

A

contains (two modules): the algorithm and the kernel function

take a standard algorithm and massage it so that all references to the original data vectors x appear only in dot products x(^T)z

Question 50

Q

an N by N (N = # of training data points) symmetric matrix of all pairwise kernel evaluations

Answer

A

The “Gram Matrix”

can build and throw away original data if “kernelized” an algorithm successfully

Question 51

Q

a support vectore machine (SVM) is nothing more than a:

Answer

A

kernelized maximum-margin hyperplane classifier

Question 52

Q

For linearly seperable data, if all the hyperplaces that separate the data without misclassifying any data points, pick the hyperplane that maximizes the margin

Answer

A

maximum margin principle

Question 53

Q

when multi dimensional vectors have have a linear boundary they share the..

Answer

A

same coverence matrix (Gaussian classifier)

Question 54

Q

A supervised task with continuous output

Answer

A

regression

Question 55

Q

Regression: what should we use for alpha?

Answer

A

Depends on your error function

For squarred error e, the mean is best
For abstract error, we get the median (absolute error)

Question 56

Q

constant model

Answer

A

(regression) says y = t, independent of x -> biggest problem for regression

Question 57

Q

What models out o the following is/are discriminate model(s)?

a. Naive Bayes
b. Logistic Regression
c. Perceptron
d. Gaussian Classifiers

Answer

A

Logistic regression, Perceptron

Question 58

Q

We know that SVM learns parameters W and b so that:

argmax{(1/ ||w||)min[(t(subscript n))(W^T θ(x(subscript n)) + b)]}

(The notations here are same as those we used in the lecture)
To convert the above optimization problem to an easier form, we need to enforce all support vectors to satisfy:
t(subscript n)(w^T θ(x(subscript n)) + b) = 1

True or False

Question 59

Q

Which statement(s) about validation and cross-validation is/are true?

a. A validation set can be used to prevent overfitting
b. In some situation you should use validation but not cross validation e.g., when training takes too long (month)
c. To obtain reliable models, you should use the entire training data to train your model and then use the entire training data to perform validation

Answer

A

a. A validation set can be used to prevent overfitting

b. In some situation you should use validation but not cross validation e.g., when training takes too long (month)

Question 60

Q

To run Naive Bayes on continuous features, ——–

a. You must transform the continuous features to categorical features
b. You must transform the features to the range [0, 1] using scaling
c. You just need to round the features to the nearest integers
d. Naive Bayes can work on continuous features directly

Answer

A

d. Naive Bayes can work on continuous features directly

Question 61

Q

If you have trained a Logistic Regression model and got the following hyperplane:
(w^T)x = (0.3, 0.5, -1.5)(1, x1, x2)^T
During test, this can be used to compute p(C1|x) with a logistic sigmoid function acting on it.
If you are given a data point x = (x1 = 5, x2 = 2), which of the following prediction is correct.
(For reference the sigmoid function is:
sigmoid(a) = 1/(1+ exp(-a))

a. p(c1 | x) > 0, so predict x to be c1
b. p(c1 | x) > 0.5, so predict x to be c1
c. p(c1 | x) < 0, so predict x to be c2
d. p(c1 | x) < 0.5, so predict x to be c2

Answer

A

d. p(c1 | x) < 0.5, so predict x to be c2

Question 62

Q

Which one of the following statements is NOT true about support vector machines?

a. The kernel trick allows SVM to classify non linearly separable data
b. SVM tries to maximize the margin between two classes
c. During test phase, SVM needs to compute the dot product between a test data point and all training data points
d. SVM is a discriminative model

Answer

A

c. During test phase, SVM needs to compute the dot product between a test data point and all training data points

Brainscape's Knowledge GenomeTM

Test 1 W1-4 Flashcards

Tuesday, October 2

Brainscape's Knowledge Genome^TM