Test 2 W5-7 Flashcards

Question 1

Q

Regression Models: What happens when M is to large or when M is to small?

Answer

A

M is the polynomial order
When M=0 (example), underfitting occurs
When M=9 (example),
overfitting occurs

Question 2

Q

What happens as polynomials increase?

Answer

A

When the order of polynomials increases, models become more complex training error keeps decreasing but test error will increase.

Question 3

Q

How can you improve the situation of overfitting?

Answer

A

have more training data or constraining the magnitude of weights (is a typical way to prevent overfitting as it controls the complexity of models) -> if you start your weights with small values, “stop (training) earlier” corresponds to constrain weight values to be smal and can control overfitting
(svm naturally does this to some degree)
> done so using regularization (technique used to control the size of weights and hence control model complexity)

Question 4

Q

When there is not enough training data, what happens to the magnitude of the weights w as M increases?

Answer

A

W typically gets larger

Question 5

Q

What is regularization?

Answer

A

Used to control overfitting. It adds a penalty term to the error function in order to discourage the coefficients from reaching larger values.
Aims to balance between minimizing unregularized error and finding “small” weights (those close to the origin)

Question 6

Q

What is the result of the following regularization values?
ln λ = -18
ln λ = 0
ln λ = -∞

Answer

A

ln λ = -18 : overfitting has been suppressed and we now obtain a much closer representation of the underlying function
ln λ = 0 : used too large a value for λ and therefore obtain a poor fit
ln λ = -∞ : corresponds to a model with no regularization

Question 7

Q

What is the result of over regularization? and how do you prevent it?

Answer

A

underfitting (therefore you should use (cross) validation to select models by picking a good regularization coefficient)

Question 8

Q

In the case of q = 1 known as the lasso model. What happens as a result if λ is sufficiently large?

Answer

A

Lasso may give us small/sparse models, where some weights are zero

Question 9

Q

How to learn parameters for linear regression models, given a training data set?

Answer

A

For square errors, minimize the error function with regard to parameter w.
set the gradient to zero with regard to w
then you can calculate “matrix” or “feature matrix” (if did not use the bases function to process your features)

Question 10

Q

How can we view linear regression? vs. a probabilistic view?

Answer

A

Linear regression: as finding a curve to minimize a predefined error (e.g. squared error or absolute error)
Probabilistic view: can think of regression as finding a curve that can generate t with some noise distribution (e.g. Gaussian) with the highest probability)

These two views are the same. The equivalence can be seen if we use maximum likelihood to estimate w in the probabilistic viewpoint/

Question 11

Q

What is unsupervised learning?

Answer

A

Machine learning with no outputs, only inputs

Question 12

Q

Def: Clustering

and what do all clustering methods require?

Answer

A

grouping similar training cases together and identifying a “prototype” example to represent each group (e.g., the mean of a cluster)

All methods require a way to measure distance between two data points, e.g. the euclidean distance

Question 13

Q

High level: what is K-means (the algorithm)

Answer

A

for a pre-defined number of clusters k, we first randomly initialize a center for each cluster
Then alternate between 2 steps:
1. K step: For all data points, assign each to the cluster whose center is closest to this data point
2. M step: Update each cluster’s center to be the mean of all points assigned to that cluster

Question 14

Q

The goal of k-means?

Answer

A

find an assignment of data points to clusters, as well as a set of centers {μ(subscript-k)} (each cluster has a center), such that the sum of the squared Euclidean distances of each data point to its closest vector μ(subscript-k) is a minimum. Where N is the number of data points. K-means minimizes the above error function by finding {μ(subscript-k)} and {r(subscript-nk)}.

Question 15

Q

In K-means how do we optimize {r(subscript-nk)}?

Answer

A

assign each point to the cluster whose center is closest to this point

Question 16

Q

In k-means how do we optimize {μ(subscript-k)} with {r(subscript-nk)} fixed

Answer

A

Set derivative of j (with regard to {μ(subscript-k)}) to be zero. (set μ(subscript-k)} to the mean of all points assigned to cluster k)

Question 17

Q

Def: k-medians

Answer

A

Instead of using squared Euclidean distance, if we use the error function, we get k-medians.
M step is still easy: just take median of all points assigned to cluster k

Question 18

Q

Def: K-medoids

Answer

A

We update the new cluster center to be one of the points assigned to that cluster.

Question 19

Q

What is the issue with k-mediods?

Answer

A

Have to try every possible point. Expensive!

Question 20

Q

Describe the following tricks used in k-means:

initialization
Ties in distance
Pick # of clusters and error functions
speed
robust errors
avoid local minima

Answer

A

Initialization: set all centers randomly
Ties in distance: add points to small clusters first
pick # of clusters and error functions: i.e. cross-validation (cluster number selected such that increase in number leads to only small reduction in error function)
speed: use some data structures to speed up (i.e. tree)
robust erors: for k-means use the squared error up to some max error then constant error beyond, or use other distances
avoid local minima: use random restarts; split and merge clusters

Question 21

Q

What is hierarchical clustering? and what are the two types?

Answer

A

Break the dataset into a series of nested clusters organized in a tree

Agglomerative alg
Divisive alg

Question 22

Q

What is the agglomerative algorithm?

Answer

A

A hierarchical cluster algorithm.

start with each datapoint in its own cluster and then successively merge similar clusters until a single cluster remains
several methods for merging. most are based on computing cluster distances from pairwise distances between all pairts of points and then merging the two clusters with smallest cluster discances

Question 23

Q

What is the divisive algorithm?

Answer

A

A hierarchical clustering algorithm

- start with all the data in a single cluster and successively split clusters

Question 24

Q

What is 1-of-k coding schema?

Answer

A

i.e. k-means

Assigns a data point to one of k clusters

Question 25

Q

What is the basic idea of mixture of Gaussian?

Answer

A

All data points to belong to multiple clusters with some possibility (i.e. assignment) during the process of clustering

Question 26

Q

What does MOG assume?

Answer

A

MOG= Mixture of Gaussian

Assumes each cluster k has a Gaussian distribution with its own mean μ(subscript-k) and covariance matrix Σ(subscript-k)

Question 27

Q

What are the steps in learning MOG (Mixture of Gaussian’s)?

Answer

A

E step: instead of “hard assignment”, we do “soft assignment” to give a data point some probability to belong to different clusters

M step: each center is moved to the weighted mean of the data, with weights given by soft assignments

Question 28

Q

What are latent variables? and why might we want to introduce them?

Answer

A

Variables that are always unobserved (i.e. the assignment probability in MOG)

Might want to intentionally introduce them to model complex dependencies between variables without looking at the dependencies between them directly

Question 29

Q

What do we use in situations when we have some unobserved variables?

Answer

A

The EM algorithm (Expectation-Maximization algorithm)

Question 30

Q

What is the EM algorithm?

Answer

A

The iterative E-step and M-step algorithm:

E-step: Estimate “expectation” of “missing” or “unobserved” variables from observed data and current parameters rs θ. (in MOG, we compute the “soft assignment” weights/ probability)
M-step: use the “completely” observed data, find the maximum likelihood parameter estimates for rs θ. (in MOG, based on soft ass’n of points to different clusters, we use regular max likelihood sol’ns to estimate mean (μ(subscript-k)) and covariances (Σ(subscript-k)))

Question 31

Q

What is the difference between EM and Mixture of Gaussians?

Answer

A

EM is an optimization strategy for objective functions when we have some unobserved variables (more specifically, when the models are conditional on these unobserved variables)
EM is not a model such as MOG and K-means; it is used in these models to help optimize these models

Question 32

Q

When should you resort to the EM algorithm?

Answer

A

When you have some missing input, labels, or latent variables in your models

Question 33

Q

What does the EM optimization algorithm used in k-means guarantee?

Answer

A

guarantees to model’s convergence to extrema but does not guarantee finding global extrema (often, use different initializations, run the algorithm multiple times, and pick the solution with minimal error to try to avoid local extrema)

Question 34

Q

What is Dimensionality reduction?

Answer

A

Another type of unsupervised learning problem. -> compress data to less dimensions while maintaining information

Question 35

Q

What is Principal Component Analysis (PCA)?

Answer

A

A typical dimensionality reduction model (and also used in data virtualization, feature extraction, pattern learning etc.)

Can be defined as the othogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized
Can also be defined as the linear projection that minimizes the average projection cost, defined as the mean squared distance between the dta points and their projections

Question 36

Q

When you order your eigenvectors by eigenvalues, highest to lowest what is V1 and V2 considered?

how do you reduce dimensionality?

Answer

A

V1 - first principal component
V2 - second principal component

often you get n components. To reduce dimensionality of p, ignore n-p components at the bottom of the list

Question 37

Q

What is the goal of PCA? Considering a set of data points {x(subscript-n)} where n = 1,…N and {x(subscript-n)} has dimensionality D.

Answer

A

Goal is to project the data onto a space having dimensionality M equal to or smaller than D, while maximizing the variance of the projected data

Question 38

Q

What is the goal of PCA?

Answer

A

Maximize the projected variance by finding a proper u(subscript-1) (unit vector which can be regarded as a constraint)

Question 39

Q

in PCA how can we define additional principal components?

Answer

A

can do so in an incremental fashion by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered.

Question 40

Q

Once we have done PCA how do we reconstruct the original data?

Answer

A

RowDataAdjust=(RowFeatureVector)^T X TransformedData

RowDataOriginal=RowDataAdjust + OriginalMean

Question 41

Q

What happens if you perform PCA data reconstruction using the complete feature vector that include all eigenvectors (1)? vs. if you leave our some eigenvectors when applying PCA dimensionality reduction (2)?

Answer

A

(1) can reconstruct exactly the original data you started with
(2) you can reconstruct the original data with some information loss

Question 42

Q

How to generate continuous latent variable data?

Answer

A

When generating data, first generate a point within the low-dimensional subspace then add noise. Coordinates in subspace are components of latent variables.

Question 43

Q

What are factor models? more specifically factor analysis?

Answer

A

Many dimen. reduc. models like PCA aim to find this subspace specified by lower-dimensional coordinates. These models are called “factor models” (including Factor analysis, Princible component analysis, independent component analysis etc.)

Specifically when we assume that the subspace is linear and that the underlying latent variable has a Gaussian distribution, we get a model known as a factor analysis.

Question 44

Q

In general how can you think of clustering and dimensionality reduction?

Answer

A

Can think of clustering as classification with missing class labels

dimen. red. as regression with missing targets

Question 45

Q

The mixture of Gaussians assumes that each cluster has its own mean but all clusters share the same co-variance matrix
True or False?

Question 46

Q

When training a constant regression model y = a, we need to use the training data to estimate a.
Given a training set {(x(subscriptn),t(subscript n))}, where x(subscript n) is the feature vector of a data point, and t(subscrip n) is the output at that data point.
If the error function that the model aims to minimize is the absolute error e = sum|t(subscript n) - a|, the model should use the median of t(subscript n) as the estimate for a.

True or False?

Question 47

Q

In regression, when we increase the order of polynomials, the training error will decrease (until the error is zero); the test error will decrease first and then when overfitting happens the test error will start to increase
true or false

Question 48

Q

The ———- clustering algorithms start with one cluster then split them to smaller clusters while the ———- clustering algorithm start with one data point per cluster then group them to reach a single cluster that contains all data points

Answer

A

divisive, agglomerative

Question 49

Q

In clustering, the further-first strategy states that:

a. during initialization, have cluster centers as far as possible from one another
b. convergence of clustering only happens when the initial centers are as far as possible fro one another
c. Data points should be moved so that clusters are as far as possible from one another
d. In the E step, data points are assigned to the further cluster center

Answer

A

a. during initialization, have cluster centers as far as possible from one another

Question 50

Q

PCA can be used to create lossless transformation where data is transformed to a new space that has the same dimensionality as the original space. In such a situation, the original data can be fully reconstructed from the transformed data without any information loss.

True or False?

Answer

A

True

For lossless compression the number of principal components must be equal to the number of the higher dimension attributes and hence no compression

Question 51

Q

In PCA, we get the following eigenvalues and eigenvectors for the covariance matrix of the training data:

eigenvalue-1: = 1. and eigenvector-1 = (−0.5 0.866)
eigenvalue-2 = 2.5  and eigenvector-2   = (0.866 0.5).

what is the first principle component?

Answer

A

(0.866 0.5)

Question 52

Q

Which of the following statement(s) about regularization is/are true?

a. One way to do regularization is to add a penalty term to the error function in order to discourage the weights w from reaching large values
b. A Lasso regularizer often give us small/sparce models, since it often set weights to be zeros
c. Non-linear bases functions can extend a linear model to do non-linear regression

Answer

A

a. One way to do regularization is to add a penalty term to the error function in order to discourage the weights w from reaching large values
b. A Lasso regularizer often give us small/sparce models, since it often set weights to be zeros
c. Non-linear bases functions can extend a linear model to do non-linear regression

Question 53

Q

Which of the following statement(s) about the EM algorithm is/are true?

a. The E step finds the maximum likelihood estimates for parameters
b. The EM algorithm is an optimization method used in situations when we have some unobserved variable
c. EM guarantees to converge and find the globally optimal solution for a problem

Answer

A

b. The EM algorithm is an optimization method used in situations when we have some unobserved variable

Question 54

Q

Which of the following statement(s) about clustering is/are true? (Select all correct answers.)

a. Clustering models are unsupervised models.
b. Clustering algorithms group similar training cases together and identifying a “prototype” example to represent each group.
c. K-means always converges.
d. K-means always finds the globally optimal solution(s).

Answer

A

a. Clustering models are unsupervised models.
b. Clustering algorithms group similar training cases together and identifying a “prototype” example to represent each group.
c. K-means always converges.

Question 55

Q

In mixture of Gaussians, _________

a. In the M step, all data points (which have been softly assigned to clusters in the E step) are used to update the means and covariances of all clusters.
b. The initialization step always gives every cluster a different covariance matrix.
c. Only data points belonging to a cluster are used to update the mean and covariances of that cluster.

Answer

A

In the M step, all data points (which have been softly assigned to clusters in the E step) are used to update the means and covariances of all clusters.

Question 56

Q

We can view regression from two perspectives: (1) finding a curve to minimize a predefined error function, or (2) finding a curve that generates observed output t with some noise distribution (e.g., Gaussian) with the maximum likelihood.

True or False?

Question 57

Q

PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is minimized.

True or False

Question 58

Q

K-medoids updates the centre of a cluster to be the median of all the data points assigned to that cluster.

True or False

Question 59

Q

Which of the following statement(s) is/are true for regression? (Select all correct answers.)

a.
Compared with lower-order polynomials, one should always use higher-order polynomial regression models, since they are more complex and powerful.
b. One should always choose regression models that make smallest errors on training data.
c. The situation of overfitting in regression can be less severe if we have more training data.

Answer

A

c. The situation of overfitting in regression can be less severe if we have more training data.

Brainscape's Knowledge GenomeTM

Test 2 W5-7 Flashcards

Tuesday, October 30

Brainscape's Knowledge Genome^TM