Test 1 W1-4 Flashcards

Tuesday, October 2

1
Q

What is Deep Learning?

A

A set of machine learning algorithms that model high-level abstractions in data by using model architectures (often neural networks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Machine Learning?

A

Let machines learn from data instead of writing programs from hand.

Provide examples that specify correct outputs for a given input and a machine learning algorithm takes the data and produces a program/ model that does the job (if done right, the program works for new cases as well, can change the program by training it on new data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Given examples of inputs and corresponding desired outputs, predict outputs on future inputs

A

Supervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Given only inputs, automatically discover representations, features, structures, etc

A

Unsupervised Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Given sequences of inputs, actions from a fixed set, and scalar rewards/ punishments, learn to select sequences in a way that maximizes expected reward

A

Reinforcement Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Outputs are categorical; inputs are anything. Goal is to select correct class for new inputs.

A

Classification (Supervised Learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Outputs are continuous; inputs are anything (but usually continuous values). Goal is to predict outputs accurately for new inputs.

A

Regression (Supervised Learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

-?- is used to scale the range of a feature

A

Feature Normalization

(need to normalize features so that the distance between two points is not governed by one particular feature with a broad range of value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Many classifiers calculate the distance between two points by -?-

A

The Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

-?- is often used as the method to find optimal solution, which converges much faster with feature scaling than without it

A

gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three phases of classification tasks.

A

Data is split accordingly to be

  1. Training set
  2. Validation (aha development or held-out) data set (used for model selection/ averaging)
  3. Test Set

Ration ofter 60:20:20

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a solution for when data is limited?

A

n-fold cross validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is supervised learning?

A

Given examples of inputs and corresponding desired outputs, the system predicts outputs on future inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

List 3 examples of supervised learning.

A

Classification, Regression, Time Series Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is unsupervised learning?

A

Given only inputs, automatically discover representations, features, structure, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is classification?

A

Outputs are categorical, inputs are anything. Goal is to select correct class for new inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Regression?

A

Outputs are continuous, inputs are anything (but usually continuous. Goal is to predict outputs accurately for new inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is an example of unsupervised learning?

A

Clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of learning is KNN?

A

Supervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In KNN what if k is too small?

A

Over fitting occurs. A learning algorithm corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. (Overfit to noise)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In KNN what if k is too big?

A

Underfitting occurs.
Think about if k is equal to the total number of training data points, KNN does not learn much from the data it just becomes a majority classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is leave-one-out validation?

A

When data scarce, may be appropriate to consider making the number of folds (cross validation) to be the number of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the K-Nearest Neighbot (KNN) ‘algorithm’?

A

A generative nonparametric classification model

Training: store all data in some way
Given a test point x, find a sphere around x enclosing k points
Classify x according to the majority of the k neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the nearest neighbor classifier?

A

in KNN of K=1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How to solve overfitting? How to pick k?

A

We can use validation or cross validation to select model

26
Q

Adv/ Dis of KNN?

A

Adv: Learning is cheap (just need to remember all data points)
Dis: - Prediction expensive (need to retrieve k nearest neighbors from a large set of N points, for each prediction made)
- in high dimensions, points are far away from each other (poor performance)

27
Q

Nonparametric classification methods

A

aka instance-based/ case-based/ memory-based method. Number of model parameters grows with number of training cases/data points
(example is KNN)

28
Q

Parametric classification methods

A

aka model-based methods

Number of parameters is fixed

29
Q

What are two basic ideas for parametric models?

A
  1. Linear classifiers

2. Decision Trees

30
Q

Linear classifiers goal

A

To find the line (or hyperplane) which can “best” (under some criterion/ objective) separate two classes

31
Q

Goal of Perceptron?

A

minimize an error function known as perceptron criterion (which associates a zero error with any data point correctly classified)

(Seeks a weight vector such that a pattern X in class C1 will have wTφ(xn) > 0 and in C2 will have wTφ(xn) < 0. When a data point is misclassified it’s feature vector is added to the current weight vector giving a new decision boundary. and the true class label ‘t’ takes the value +1 for C1 and -1 for C2)

32
Q

Optimization algorithm, -?- is often applied to minimize the errors, which updates model parameters at each training data point.

A

stochastic gradient descent (SGD) (aka online algorithm)

33
Q

What does the perceptron convergence theorem state?

A

if there exists an exact solution (if the training data is linearly separable) then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.

If the data is not linearly seperable, perceptron will not converge

34
Q

What is the Decision Trees Alg

A
Parametric Model
- a classification tree, where (1) each internal node represents a "test" on variable/ feature; (2) each branch represents the outcome of the test, and (3) each leaf node has a class label
35
Q

What is Decision Tree Training?

A

(Greedily) search features, choose one that splits data in order to reduce impurity most, which is measure with:

  • GINI
  • Entropy
  • Misclassification Errors
36
Q

Generative vs. Discriminative models

A
Generative: Use Bayes' theorem to find the posterior class probability p(Ck|x).
(by sampling from the joint distribution possible to generate synthetic/ unseen data points in the input space)

Discriminative: Learn p(Ck|x) directly by just using a discriminant function to map each input x directly onto a class label, where probabilities play no role

37
Q

Decision THeory

A

Used for both Generative/ Discriminative models to choose a class label.

Tells us how to make optimal decisions given the appropriate probabilities (minimize the chance of assigning x to the wrong class)

38
Q

What is leave-one-out validation?

A

When data is scarce, consider making the number of folds to be the number of data points

39
Q

What is an implementation issue of KNN?

A

Prediction is expensive. You need to retrieve k nearest neighbours from a large set of N points.

40
Q

What is the goal of a decision tree?

A

Minimize expected sum of impurity at leaves.

41
Q

Gaussian Classifier

A

A generative model (with continuous input)

Closed-form solution in which you obtain parameters using maximal likelihood estimation.
Need to manipulate mean vectors and co-variance matrixes when extending from univariate to multivariate gaussians

42
Q

Naive Bayes Classifier

A

A generative model (with discrete input)

43
Q

Naive Bayes Classifier Assumption?

A

conditioned on class, features/ variables are independent

44
Q

How do we perform Naive Bayes on continuous input e.g.., in a Gaussian classifier?

A

Gaussian Bayes Classifier

Has a separate, diagonal covariance matrix for each class

45
Q

logistic regression models are:

A

Discriminative

- have the same number of adjustable parameters as dimensions of the features space

46
Q

For a large value of M, (in an M-dimensional feature space) what is the best model to use?

A

Logistic regression (over Gaussian classifiers)

47
Q

What do we use to allow for direct calculations in an extended feature space?

A

use the “kernel trick”

48
Q

What is the key idea of Kernel methods?

A

reduce an algorithm to one which depends only on dot products between data vectors. Then replace the dot product with a kernel function k(x, z)

49
Q

a kernel machine

A

contains (two modules): the algorithm and the kernel function

take a standard algorithm and massage it so that all references to the original data vectors x appear only in dot products x(^T)z

50
Q

an N by N (N = # of training data points) symmetric matrix of all pairwise kernel evaluations

A

The “Gram Matrix”

can build and throw away original data if “kernelized” an algorithm successfully

51
Q

a support vectore machine (SVM) is nothing more than a:

A

kernelized maximum-margin hyperplane classifier

52
Q

For linearly seperable data, if all the hyperplaces that separate the data without misclassifying any data points, pick the hyperplane that maximizes the margin

A

maximum margin principle

53
Q

when multi dimensional vectors have have a linear boundary they share the..

A

same coverence matrix (Gaussian classifier)

54
Q

A supervised task with continuous output

A

regression

55
Q

Regression: what should we use for alpha?

A

Depends on your error function

  • For squarred error e, the mean is best
  • For abstract error, we get the median (absolute error)
56
Q

constant model

A

(regression) says y = t, independent of x -> biggest problem for regression

57
Q

What models out o the following is/are discriminate model(s)?

a. Naive Bayes
b. Logistic Regression
c. Perceptron
d. Gaussian Classifiers

A

Logistic regression, Perceptron

58
Q

We know that SVM learns parameters W and b so that:

argmax{(1/ ||w||)min[(t(subscript n))(W^T θ(x(subscript n)) + b)]}

(The notations here are same as those we used in the lecture)
To convert the above optimization problem to an easier form, we need to enforce all support vectors to satisfy:
t(subscript n)(w^T θ(x(subscript n)) + b) = 1

True or False

A

True

59
Q

Which statement(s) about validation and cross-validation is/are true?

a. A validation set can be used to prevent overfitting
b. In some situation you should use validation but not cross validation e.g., when training takes too long (month)
c. To obtain reliable models, you should use the entire training data to train your model and then use the entire training data to perform validation

A

a. A validation set can be used to prevent overfitting

b. In some situation you should use validation but not cross validation e.g., when training takes too long (month)

60
Q

To run Naive Bayes on continuous features, ——–

a. You must transform the continuous features to categorical features
b. You must transform the features to the range [0, 1] using scaling
c. You just need to round the features to the nearest integers
d. Naive Bayes can work on continuous features directly

A

d. Naive Bayes can work on continuous features directly

61
Q

If you have trained a Logistic Regression model and got the following hyperplane:
(w^T)x = (0.3, 0.5, -1.5)(1, x1, x2)^T
During test, this can be used to compute p(C1|x) with a logistic sigmoid function acting on it.
If you are given a data point x = (x1 = 5, x2 = 2), which of the following prediction is correct.
(For reference the sigmoid function is:
sigmoid(a) = 1/(1+ exp(-a))

a. p(c1 | x) > 0, so predict x to be c1
b. p(c1 | x) > 0.5, so predict x to be c1
c. p(c1 | x) < 0, so predict x to be c2
d. p(c1 | x) < 0.5, so predict x to be c2

A

d. p(c1 | x) < 0.5, so predict x to be c2

62
Q

Which one of the following statements is NOT true about support vector machines?

a. The kernel trick allows SVM to classify non linearly separable data
b. SVM tries to maximize the margin between two classes
c. During test phase, SVM needs to compute the dot product between a test data point and all training data points
d. SVM is a discriminative model

A

c. During test phase, SVM needs to compute the dot product between a test data point and all training data points