What does GPT stand for?
Generative Pre-trained Transformer
How do LLMs like ChatGPT produce text?
They take in a passage, and predict the next word in the passage. They output a probability distribution for the next word, then sample from that distribution. Over and over and over.
How would you turn a model like that into a chat bot?
Basically with a system prompt. So you pass something like the following into the model and ask it to complete it:
“What follows is an interaction between a user and a helpful AI assistant:
User: <user></user>
AI Assistant: _______”
What is a token?
The unit of content that ChatGPT predicts one at a time. Words, pieces of words, or punctuation marks.
In other domains, it could be little chunks of an image, or little patches of a sound for audio processing.
What is the core concept underlying transformers?
Self attention. Better understanding a particular part P of the input by learning to pay attention to other parts of the image to inform your understanding of P.
What is a transformer layer essentially?
A transformer layer is basically just a layer which applies self attention to an input sequence, plus some additional frills for performance (though they are meaningful: for example, the MLP after each attention mechanism seems to be where ChatGPT stores facts).
So it receives some sort of embedding of every input in a sequence, and uses self attention to output new, better embeddings for that sequence.
What is taking the dot product of two vectors?
Pairwise multiplying each entry in the vectors, then summing those all up to produce a scalar
It’s positive if they point in similar directions, 0 if they’re orthogonal, negative if they point in “opposite-ish” directions
What is the first step in GPT-3 processing an input?
Pass all the “words”, i.e. tokens, through an initial learned embedding matrix
What are the inner workings of a chatbot model like GPT-3 working with? What thing are they continually refining?
They’re refining those initial, “baseline” embeddings for each word to be richer, more “context-dependent word embeddings” as I call them.
For example, coloring “king” by the fact that it seems like a royal king, described in Shakespearean language, rather than a king on a chess board.
In these modern LLMs, these embeddings eventually get really, really really rich with meaning, such that at the very end you can use just the final embedding to accurately predict the next word.
What is the unembedding matrix at the end of GPT?
An embedding matrix takes the one-hot encoded vector of the word in the model’s vocabulary, and maps it to an embedding.
An unembedding matrix takes the last contextual word embedding in the sequence, which is what GPT uses to predict the next word, and maps it to a vector with length equal to the vocabulary size, where the scalars are logits for which word is best to predict next. That’s then passed through a softmax to get predictions.
What’s the main reason why GPT only uses the final word embedding to predict the next word, rather than all of the final embeddings for the whole input sequence?
It makes for more efficient training to have each word embedding in that final sequence used to predict the next word, so for each forward pass you get thousands of predictions you can backpropagate on
(Note that this says quite a lot about just how rich these embeddings get, like 3b1b explains. I like the “contextual embedding” outlook, but by the end in reality they get even richer than that: even if the last word is just “the” for example, you can successfully predict the next word from only that embedding!)
What specifically is the goal of one attention block applied to word embeddings?
To compute the delta that needs to be applied to those embeddings in order to make them more rich contextually. It’s to compute the E-delta’s that you add to the Es to get the E’s here
What are Q K and V
These matrixes are just re-representations of the incoming embedding matrix, achieved by multiplying the incoming matrix by 3 weight matrixes, W_Q W_K and W_V. They have the same number tokens/columns, but different dimension/rows
What is matrix Q conceptually? Or easier, what is its entry q for one word in the input?
The qs, or queries, can be thought of as “the input word asking a question, that it can use to better contextually embed itself.” Like “are there any adjectives modifying me?”
What is K conceptually?
The ks, or keys, are the potential answers to the questions being asked by the queries. For example, maybe the value encodes “yes, I’m an adjective!” if the word is an adjective, providing an answer to the question.
How do you determine how well each key matches each query?
Compute a dot product of every possible query-key pair, yielding a scalar for each pair. So you basically see how “similar” the query and value are.
So basically, a big matrix multiplication of the Q and K matrices.
What does it mean for a k to “attend to” a q?
It means that the network realizes it should have attention on the word corresponding to k when interpreting the word corresponding to q
What do we do to K^T * Q, the dot products of all the query and key vectors?
We pass them through a softmax (within the transformer block, not just at the end of the network), so basically the dot products that are negative and around zero go to nothing, and the large positive magnitudes are the ones that matter. These are the ones that show words that “attend to” other words.
The softmax is over all keys, for each query. So as per 3b1b it’s a column-wise softmax. For each query, you’re making a vector showing how much each key attends to that query.
There’s also a simple scaling term used here for stability. You divide by the sqrt of the dimension of the q and k vectors.
In 3b1b’s and Jay Alammar’s explanations, does the data flowing through the network have words/tokens along rows, or columns?
Columns. Hits my brain weird but it’s the way they do it
In the original Attention is All You Need paper, they do rows. But in these flashcards I generally do columns, because I used 3b1b’s videos (and also cuz it seems kinda nice I think)
This isn’t a real flashcard, but for reference, I want to link a note I have about an error in 3b1b’s videos that tripped me up in understanding this stuff: https://docs.google.com/document/d/1ahiSnxsoEKXEe1Pq-gYK3AsVFMgykHZoE7czyy0qO4s/edit?tab=t.0
What is masked self attention?
You stop the network from having later words attend to earlier words. In the case of GPT-3, this is useful so every word in a batch can be a training example for next-word prediction, without having the model “cheat by looking later in the sequence”
How is masked self attention accomplished computationally?
Before you apply the column-wise softmax to (K^T * Q), you set all the entries in the matrix corresponding to a key at index beyond the query’s index to be negative infinity. So when you pass it through softmax, the attentiveness on those later values becomes zero.
How does the complexity of a transformer block scale with context size? Where in the transformer operation does this come from?
Quadratically. It comes from K^T * Q, which has dimension of (context size * key/query shared dim) * (key/query shared dim * context size)
What is V conceptually?
Using Q and K, we’ve determined which words are relevant to which other words. Now we need to use that to update embeddings.
So now, in a basic conceptual sense, we need to know, “if word A is relevant to this other word B, how do we update the embedding for word B based on word A?”
By multiplying the input words in X by M_V to get V, we basically answer, “if word x is relevant to some other word, how should we update that other word?” V encodes the answer to that question for each input x. So it’s not “how should x be updated?”, it’s “if x is relevant to something else, how should that something else be updated?”
So each value v can be thought of as being associated with a key k, not a query q. Because they’re associated with the word that does-the-informing-wrt-the-word-being-updated.