Transformers Flashcards

Question

Alright, so we've got the two pieces to finish off the attention mechanism. You've got V, where each column says "if this column is relevant to something, how should that something be updated?" And you've got the (context x context) sized matrix product that came out of K and Q, that shows pair-wise, "does the word in row i attend to the row in column j? Should row i's word cause an update to the word in index j?" How do you take these two components and produce our final attention output?

Answer 1

To start, you multiply them together as pictured. By turning a col of the attention matrix on its side, and summing all the cols in V according to those values, what you're doing taking the softmax over what is important to query q, and using those coefficients to take a weighted sum of all the values in V, which are statements on how to update the embedding for the word associated with q if the word associated with v attends to the q word. So exactly what we want! So that final matrix multiplication gives the *deltas* we want to the embeddings. So we add that result to the original embeddings to get our output.

Answer 2

embeddings = embeddings + :

Answer 3

Rather than a giant (embed_dim x embed_dim) matrix, it is stored as a low-rank representation of a (embed_dim x embed_dim) matrix, in the form of two matrices of size (embed_dim x smaller_size) and (smaller_size x embed_dim). In the case of GPT-3, the smaller_size there happens to be the same size as dimension of q's and k's. 3b1b calls these two matrixes the value_down and value_up matrices.

Answer 4

There are two different strings of tokens, and you do attention across them, rather than attending from one string to...that same string. So for language translation, you see which french words attend to which english words, for example.

Answer 5

In multi-headed attention, you simply do attention several distinct times within an attention block, *kinda* like a convolutional layer with multiple output channels. A single attention head, like we've discussed here, learns one way of understanding attention. Using M_Q, M_K, and the decomposed M_V, it learns one way to update the embedding of a token based on the other tokens in its context. But there could be several ways of doing this. Maybe one way is to have the query say "what adjectives are modifying what nouns?" and another say "are there other proper nouns that add color to the context of this proper noun?" etc. Each head of attention within the block gets its *own* M_Q, M_K and decomposed M_V. You're just doing attention several times over.

Answer 6

You just take the delta in the embedding output by each head of attention, and add *all* of them to the embedding, all at once. So for example, 96 different instances of M_Q, M_K and (decomposed) M_V yield 96 different deltas, which are all summed at the end of the multi headed attention block and added to the initial embeddings that came into that attention block.

Answer 7

Across multi-headed attention, all the value up matrices are actually stapled together, and referred to as a single "output" matrix. And those value down matrices will actually be referred to as just "value" matrices. Possible source of confusion and parlance difference to be aware of. Furthermore: so we're thinking of v = M_V_up * M_V_down * x. And we multiply the v's by the attention softmax weights we got from Q and K. In practice, you multiply just (M_V_down * x) by those attention weights, then up-project *that* with M_V_up. This is mathematically equivalent to doing it the straightforward way, but it saves on some computation because (M_V_down * x) is smaller than (M_V_up * M_V_down * x).

Answer 8

A slightly modified, smoother ReLU that models sometimes use

Answer 9

A hyperparameter you can set that kinda determines how "creative" the model will be. More specifically, it determines how likely or unlikely the model is to predict values that aren't at the top of the probability distribution. Higher temperature means a higher likelihood of these lower-down values.

Answer 10

It's done within the final softmax function. When you do e^logit for each logit and then sum them, you instead do e^(logit/T) for the temperature value T. So for example, if T=1, it's just a normal softmax, but if T is high, it brings the extreme positive values closer to the more middling positive values. T=0 means always picking the most likely outcome, but that would be dividing by zero, so we create this behavior with a simple if-else statement: "if T=0 just pick the most likely word, else do the probability computation as normal"

Answer 11

It flows through the MLP block ***wait but I think that's technically incorrect. It does layer norm first. Then does another layer norm after NLP also https://dugas.ch/artificial_curiosity/img/GPT_architecture/fullarch.png https://dugas.ch/artificial_curiosity/GPT_architecture.html I should update this card once I dig more into gpt3's exact architecture

Answer 12

Linear layer -> ReLU -> Linear -> add that to the input to the linear layer The first linear layer significantly increases dimension, and the second brings it back down to where it was (so similar to how the attention block computes a delta, this block is also computing a delta that is added to its input)

Answer 13

Not typically As per http://ai.stackexchange.com/questions/40252/why-are-biases-typically-not-used-in-attention-mechanism and just watching 3b1b's video The MLP block after the attention block uses biases, naturally, as it's an MLP

Answer 14

A layer norm operation after the MLPs. (Layer norm is described in more detail in the deep learning deck) **again, I think this diagram from Grant may be freaking wrong! Or at least there's a disagreement between my sources, 3b1b and here https://dugas.ch/artificial_curiosity/GPT_architecture.html will need to resolve this disagreement and update this slide accordingly

Answer 15

BERT is a language model that takes as input a phrase and returns context-dependent word embeddings for each of the words, as well as a context-dependent embedding for start and end tokens which are placed at the beginning and end of the word. ("Context dependant word embeddings" is how I think about it.) "Base" embeddings are size 768; "large" embeddings are size 1024.

Answer 16

SBERT is essentially a fine-tuned version of BERT that pools word embeddings to create good sentence embeddings. For a given input phrase, BERT's output is an embedding for each word, and the start and end tokens. These can be "pooled" to make sentence embeddings: one common option is simply to output the embedding of the start token as your sentence embeddings, and another is to take the average of all the embeddings. SBERT automatically does one of these based on the version (so its output on a given phrase is a single sentence embedding vector), and it has been fine-tuned to be good at this specifically.

Answer 17

You have SBERT embed two sentences, then use something simple like cosine similarity to calculate the sentences' similarity based on the sentence embeddings, and then you compare this to a label you have between 0 and 1 showing how similar they are.

Answer 18

Suppose we're using it for sentence/phrase/paragraph embeddings specifically, getting it from the embedding of the start token. We can use this for search engines: get an embedding of the text from every page on the internet, and get an embedding for the phrase the user input to the search engine, and return pages whose embeddings are similar to the query's embeddings

Answer 19

Input your documents to BERT and get a sentence embedding for each, then train a few additional layers to predict your outcome variable based on those embeddings. If you have lots of data you could also fine-tune BERT itself.

Transformers Flashcards

(43 cards)