W8-12 Flashcards

New Material Included on the Exam

1
Q

Def: First-order Markov models vs higher-order (i.e. second-order) Markov models

A

A first-order Markov chain of observations {x(subscript t)} in which the distribution of particular observation {x(subscript t)} s conditioned on the value of the previous observation {x(subscript t - 1)}

A second-order Markov chain, in which the conditional distribution of a particular observation {x(subscript t)} depends on the values of the two previous observations {x(subscript t - 1)} and {x(subscript t - 2)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the process of learning Markov models?

A

You just need to count and calculate the transition probabilities based on training data you have

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Markov Models: Consider if you apply 2nd order Markov models to predict the next English word, there are many three-word sequences you may not have observed training data but only in the test phase
During training you may need to save some probabilities for unseen cases, which is called ______.

A

smoothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the following Markov model parameters?
Start/ Initial Probability
State transition probability
Emission probability

A

Start/ Initial (self explanitory)
State transition - the probability of switching from one hidden state to another
Emission - The probability of observing x of a certain state (i.e. for dice room probability of rolling a 1 fair vs loaded)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is the key assumption of Hidden Markov Models?

A

The key independence property is that the hidden variables {z(subscript t - 1)} and {z(subscript t + 1)} are conditional independent given {z(subscript t)}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In hidden Markov Models, what happens if hidden states are not given?

A

Unlike in (non-hidden_ Markov models, observation x(subscript t) depends on all previous x(subscript i) (i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three main problems of HMMs (unsupervised)?

A
  1. Decoding: GIVEN an HMM specified by θ, and a sequence x, FIND the sequence z that maximizes p(z|x,θ),
    i. e., z* = argmax(sub z) p(z|x,θ)
  2. Evaluation : GIVEN an HMM specified by θ, and a sequence x, FIND p(x|θ) = Sum(sub z) p(x, z|θ)
  3. Learning: GIVEN a set of observed sequences X, FIND parameters θ that maximize p(X|θ),
    i. e. θ* = argmax(subθ) p(X|θ)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HMM: What is the naive method of finding z* and what is the problem with this method?

A

The naïve method of finding z* is to compute all p(x,z|θ) for all values that z can takes
Problem: Exponential numbers of z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the main idea of Decoding called? and what kind of programming is this?

A

“Bugs on a Grid”

Dynamic programming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the “Bugs on a Grid” steps for decoding?

A
  1. put bugs in each state at t=1, holding values of initial probabilities.
  2. move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
  3. at each node of time t+1, only keep one copy of K incoming bugs which carries the largest probability.
  4. go to 2 until bugs have reached time n
  5. find the bug carrying the largest probability at time n, get the path that winning bug has passed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the time complexity and space (big O) of the Viterbi Algorithm?

A
Time complexity: O((K^2)n)
Space O(Kn)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Underflows are a significant problem of the Viterbi algorithm. What is the solution?

A

p(x1,…,xn,z1,…,zn) - these numbers become extremely small - underflow
Solution: Take the logs of all values
Vk’(t+1) = log (ek’, xt+1) + maxk(Vk(t) + log ak,k’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the steps that are different when doing the “Bugs on a Grid” algorithm for Evaluation? What is another name for this algorithm?

A

Steps 3 and 5 change:

  1. put bugs in each state at t=1, holding values of initial probabilities.
  2. move each bug forward to time t+1 by making copies of each of them to all K node at t+1 and incrementing the value of each copy by the probability of the transition and output emission.
  3. at each node of time t+1, replace all the bugs with a single bug carrying the sum of their values
  4. go to 2 until all bugs have reached time n
  5. sum up values on all bugs.

AKA “forward algorithm”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the running time, and space required, for Forward and Backward algorithms?

A
Time complexity: O((K^2)n)
Space O(Kn)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the posterior decoding calculation calculate?

A

What is the mst likely state at tiem t given sequence x?
argmaxkP(zt= k | x)

(forward algorithm and backwards algorithm allow us to calculate)

p(zt= k|x) = fk(t) bk(t)/ p(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Problem 3 of HMM Learning requires that parameters θ that maximize p(X|θ) are found. How can we compute the distribution of hidden states over paths if an HMM’s parameters are known?

A

Use the EM algorithm

  • initialize HMM parameters
  • E STEP: Compute useful distributions about hidden states, which will be used to estimate model parameters in the M step
  • M step: Compute max likelihood estimates for model parameters, as hidden states distributions are known
  • repeat E step and M step until convergence criterion is satisfied (e.g., when likelihood p(X|θ) does not increase too much
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do we do the EM algorithm efficiently?

A

Use the “Bugs of a grid” dynamic programming trick
- Initialize HMM parameters θ =(pi,A,E)
- given a training sequence x = {x1, x2, …, xn}
E STEP: Compute useful distribution about hidden states that will be used to estimate model parameters later in the M step
1. compute p(zt= k|x) for all k = 1,…,K (calculate using forward-backward algorithm)
2. Compute p(zt=k,zt+1=k’|x) = fk(t) ak,k’ek’,xt+1bk(t) / p(x)
M STEP: Compute max likelihood estimates for model parameters, given hidden state distribution is known
1. Estimate initial probability
2. Estimate Transition probability
3. Estimate Emission Probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define: Deep Learning

A

A set of machine learning algorithms that model high-level abstractions in data using artificial neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why deep learning now?

A

Computers finally have computation power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Neuro net algorithms are inspired by ___

A

the brain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does a Linear Neuron calculate?

A

A linear neuron just computes a weighted sum of input features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a multi layer of linear neurons?

A

Still a linear model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does a Binary Threshold Neron do?

A
  • First compute weighted sum of the inputs

- then send out a fixed size spike of activity if the weighted sum exceeds a threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does a Rectified Linear Neuron calculate?

A

They compute a weighted sum of their inputs. Then it is passed to a rectified linear unit. (sometimes called linear threshold neurons)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

We have shown that the ____________ is relevant to “intellegence” of systems or living beings.

What is the other factor that is important?

A

Number of neurons

structure (even if computer has same number as a human, lacks the common sense, can’t reason)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a Feed Forward Neural Network?

A

Most common type of neural network for practical applications
- first layer input
- last layer output
Compute a series of transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

If there is more than one layer of a Feed Fforward neural Network, we call them ____

A

“deep” neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the activities of the neurons in each layer of a feed forward neural network?

A

a non-linear function of the activities in the layer below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Feed Forward networks are fully connected

True or False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

It has been proven that networks with one hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy
True or False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are Large, Shallow models more prone to?

A

Overfit

convolutional networks (sparsely connected feed-forward) are shallow models - overfit around 20 million parameters while deep ones can benefit from having over 60 million and for the same number of parameters, deep models achieve better performance than shallow ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Why do we want a deep network rather than that with only one layer?

A

Shallow networks may overfit more

Deep net often achieve better performance with same number of parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How deep do we want our neural nets?

A

task dependent (no universal answer)

For some problems, the depth of models are naturally given by the problems (i.e., in recurrent neural nets a network ofen has a depth “by time” (number of steps))

34
Q

What does the Universal Aproximator Theorem state?

A

1 hidden layer is enough to represnt an approximation of any function to any arbitrary degree of accuracy

35
Q

How many layers classifies a shallow model?

A

3 layers or less

36
Q

What is the problem with this method of training Feed Forward Nets? :
Randomly perturb one weight and see if it improves performance. If so, save the change.

A

Very inefficient. Need to do multiple forward passes on a representative set of training data just to change one weight.

37
Q

What is the problem with this method of training Feed Forward Nets? :
Randomly perturb all the weights in parallel and correlate the performance gain with the weight changes

A

It is still hard to figure out on which weight(s) the change improves performance

38
Q

When training feed forward neral nets we don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. What is the key idea to doing this?

A

Minimize an error function by computing the derivative w.r.t. model parameters and set that to be zero ‘
(FF neral nets, Derivatives are computed with an algorithm called: BACKWARD PROPOGATION AKA BACKPROP)
Once you get derivatives, there are many optimization algorithms you can chose to use to find good parameters

39
Q

How do you train Feed-Norward networks using forward and backward propogation? (4 steps)

A
  1. Forward propogation. Compute hidden units and output
  2. Compute loss/errors
  3. Back-propogation. Compute derivatives
  4. Use some existing optimization algorithm to find good parameters (those minimizing errors)
40
Q

In nerual networks optimization algorithms what is not guaranteed?

A

Do not guarantee find globally optimal solution (often, the error function is not convex and may have many local optimums)

41
Q

How do you check the correctness of your backprop implementation?

A
  • Suppose we have implemted backprop to compute dE/dθ
  • If the backprop is implemented correctly we should have
    dE/dθ ~= (E(θ+e)-E(θ-e))/2e

In neural network, θ is a vector that includes the weights w of all layers (including biases), The outline of the checking algorithm is:

  • we randomly pick a θ, and then compute E(θ+e) and E(θ-e) with forward propagation
  • we also compute E(θ) with forward propogation and based on that compute dE/dθ with backprop
  • check if the equation above is satisfied or not
42
Q

What are the two types of memoryless models for sequences?

A
  • Autoregressive models: Predict the next term in a sequence from a fixed number of previous terms (i.e. Markov Models)

Feed-Forward neural nets: generalize autoregressive models by using one or more layers of non-linear hidden units

(both cases: no memory about longer history)

43
Q

When hidden Markov model generates data. Ad each time step it must select one of its hidden states. So with k hidden states it can only rememer ——– bits about what it generated so far

A

log(k)

44
Q

What does the information that the first half of a spoken sentence of an HMM contains about the second half?

A
  • The syntax needs to fit (e.g. number and tense agreement)
  • The semantics needs to fit. The intonation needs to fit.
  • The accent, rate, volume, and vocal tract characteristics must all fit

All these aspects combined could be 100 bits of information that the first half of an utterance needs to convey to the second half. 2^100 is big!!!

45
Q

What two properties makes Recurrent Neural networks more powerful?

A
  • Distributed hidden state that allows them to store a lot of information about the past efficiently
  • non-linear dynamics that allows them to update their hidden state in complicated ways
46
Q

RNNs: all time steps share the same — and — functions

A

transition and emission functions

47
Q

How can we think of the recurrent net?

A

as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints

48
Q

How can we think of the backprop of a RNN training algorithm in the time domain?

A
  • the forward pass builds up a stack of the activities of all the units at each time step
  • the backward pass peels activities off the stack to compute the error derivatives at each time step
  • after the backward pass we add together the derivatives at all the different times for each weight
49
Q

What are the RNN learning restrictions?

A

The automaton is restricted to be exactly one state at each time. The hidden units are restricted to have exactly one vector of activity at each time.

50
Q

If RNN is more powerful than HMM why has HMM been used for many years?

A

It is difficult to train RNN: time consuming bout also because of a phenomenon called vanishing/exploding gradient

51
Q

RNNS. What happens to the magnitude of the gradients as we backpropogate through many layers?

A
  • If the weights are small, gradients shrink exponentially
  • If the weights are big the gradients grow exponentially
  • in addition, activation function can also cause vanishing gradients

(if the network only has a few hidden layers, this won’t make too much trouble)

52
Q

RNN is often trained on long sequences and the gradients can easily explode or vanish, and RNN —- ——, which can make the problem be serious.

What do RNNs have difficulty daling with as a result?

A

share parameters

RNNs have difficulty dealing with long-range dependencies

53
Q

What is (the most famous) solution to vanishing/exploding gradients?

A

The design of specific types of RNN architecture -> Long Short-Term Memory (LSTM) - uses gates to control information flow in RNN

(Another less widely used is gated recurrent unit)

54
Q

What is the purpose of the input gate and the forget gate in the Long Short-Term Memory model?

A

The input gate decides how much information is allowed to get in memory cell from the input
The Forget gate (should actually be called the memory gate) decides how much information is allowed from the previous memory

55
Q

What is the purpose of the sigmoid function in the Long Short-Term Memory model?

A

Ensures the gate values are between 0 and 1

56
Q

Convolutional Nerual Networks (LeNet)…
Yann LeCun and others developed a really good recognizer for handwritten digits by using backpropogation in a feedforward net with:???

A
  • many hidden layers
  • use filters of replicated units to share parameters
  • use subsampling operation
57
Q

What is the convolution state of CNN?

A

The first key component of CNN.
A filter shifts on input to detect local features. At the convolutional layer of a CNN, we often have multiple numbers of such filters (each has its own parameters)/

58
Q

What is the difference between convolution and fully connected?

A

Convolution has fewer parameters! al
(i.e. in the example from the notes/ question the first output is only connected to 9 inputs not fully connected to all 36)
also has shared weights

59
Q

Does convolution share the same parameters across al spatial locations? What does traditional matrix multiplication do?

A

Yes, traditional mm does not share any parameters

60
Q

What is subsampling used for? (what are the 3 reasons it is used)

A

Can subsample pixels to make image smaller, so (1) fewer parameters are used to characterize the image, (2) computation is less expensive, and (3) the view of each filter at an upper layer can see a wide range.

61
Q

Max pooling introduces some — for image translation.

A

invariance

62
Q

What is another widely-used pooling, instead of taking max that is used?

A

Average pooling - to take average of an area

63
Q

What is used in CNN after convolutions and subsampling?

A

fully connected layers

64
Q

How does CNN control the number of model parameters?

A
  • reducing number of connections
  • shared weights on the edges
  • pooling further reduces the complexity
65
Q

What is ImageNet?

A

~14 million labeled images, 20k classes

used for CNN learning

66
Q

Training CNN - how is training done for Max pooling subsampling

why?

A

for a pooling area with N-by_N input units, you just need to backprop gradient to the unit that has the max value (the winning unit)

why? if you make a small change in other non-max units, it will not change the pooling results, indicating gradient is zero w.r.t. that non-winning unit

67
Q

Training CNN - how is training done for Average pooling subsampling

A

take derivative w.r.t. the corresponding units (i.e. the error is multiplied by 1/N^2 and assigned to the whole pooling area (all units get this same gradient value))

68
Q

We can improve the performance of a model by putting prior knowledge about the task into the network by designing appropriate:

What is another way we can use prior knowledge?

A
  • connectivity
  • weight constraints
  • neuron activation functions

Less intrusive then hand-designing the features \

Alternatively can use prior knowledge to create a whole lot more training data

69
Q

Consider a neural net with one hidden layer. Each time we present a training example, we randomly omit each hidden unit with probability 0.5. What is this called?

A

dropout

works very well for different neural network architectures
Can also be beneficial for input layer but with a higher probability of keeping input unit

70
Q

Training different good models and AVERAGING/assembling the models can often lead to —– ———

A

better performance

71
Q

Why dropout?

A

(dropout as a form of model averaging) Dropout can be seen as sampling from different “thinned” architectures (each hidden layer now use different nomber of hidden units) —> at test time could sample many different architectures and take the average of their output distributions

  • best to use all hidden units, but to halve their outgoing weights
  • we sample from 2^H models
  • the sharing of the weights means that every model is very strongly regularized
  • much better than Lasso or squared penalities that pull the weights towards zero
72
Q

If you deep neural net is significantly overfitting, dropout will ……????

A

usually reduce the number of errors by a lot

73
Q

True or False

Any net that uses “early stopping” does worse by using dropout

A

Falso - can do better

at the cost of taking quite a lot longer to train

74
Q

True or False

If your deep neural net is not overfitting you should be using a bigger one

A

True

75
Q

What is an autoencoder?

A

a neural network trained to copy its input to its output
Traditionally used for dimentionality reduction as well as feature learning

hoping that training it will result in taking on useful properties

76
Q

How can we use backprop to implement PCA?

A
  • design a feed forward net trying to make the output be the same as the input with a central code layer
  • hidden and output layers linear, will learn hidden units that are a linear function of the data and minimize the squared reconstruction error
  • M hidden units will span the space as the first M components found by PCA, although they are not exactly the same vectors
77
Q

The input-to-hidden part of an autoencoder corresponds to a(n) ———
The hidden-to-output part corresponds to a(n) ————-

A

encoder

decoder

78
Q

What are the two types of autoencoders?

A

Undercomplete: the code layer h has smaller dimension than input vector x (more widely used)
overcomplete: the code layer h has larger dimension than input vector x

79
Q

If the encoder and decoder are allowed too much capacity what is the risk?

A

the autoencoder can learn to perform the copying task without extracting useful information about the distribution of the data
- useful autoencoders should not copy perfectly but ristricted by design to copy only approximately (in doing so learns useful properties of the data to reconstruct the data)

80
Q

What are some ways to design autoencoders to copy only approximately, so that the system learns useful properties of the data

A
  • sparsity of representation
  • smallness of the derivative of the representation
  • robustness of noise of missing inputs
81
Q

What is a denoising autoencoder?

A
  • one that recieves a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output
82
Q

How does a denoising autoencoder learn the reconstructed distribution?

A
  • choose a training sample from the training data
  • obtain corrupted version from curruption process
  • use training sample pair to estimate reconstruction