How Transformer Architecture Powers LLMs

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5168

    #1

    How Transformer Architecture Powers LLMs

    We use LLMs every day, but most explanations stop at

    “it’s a transformer” and move on.


    What actually happens between a prompt and the next generated word?

    How does the model decide what matters and what doesn’t?


    This article breaks down that flow — step by step — without math,

    and without hand-waving.



    🧠 How Transformers Differ from Traditional Models

    Older language models processed text sequentially, focusing mostly on neighboring words.


    That meant:
    • Limited long-range understanding
    • Difficulty connecting distant words in a sentence


    Transformers changed this by doing something radical:


    They consider the relationship between every word and every other word — all at once.


    Instead of asking only:

    “What word comes next based on the previous one?”


    They ask:

    “How does every word relate to every other word in this sentence?”


    This is what allows LLMs to understand context at scale.



    🧩 Breakdown of the Transformer's core components

    Below are the key components that transform raw text into predictions.


    1. Tokenization - Turning Text Into Numbers


    Before anything else, the prompt is converted into tokens.






    Example:
    Prompt: "Write a story about dragon"
    Tokens: [9566, 261, 4869, 1078, 103944]







    Why this step exists?


    Models don’t understand raw text.

    They operate on numbers.


    At this stage:
    • Tokens are just identifiers
    • They carry no meaning or context
    • “dragon” is just a number, not a concept


    That limitation is solved in the next step.


    2. Vector Embeddings - Adding Meaning Beyond Words


    Vector embeddings capture semantic meaning — words with similar meanings end up closer together in vector space.


    Consider these two sentences:
    • “He deposited money in the bank
    • “They sat near the river bank


    Tokenization treats bank the same in both cases.


    Why embeddings are needed?


    Vector embeddings represent words in a multi-dimensional space where meaning depends on context.






    Example:
    bank (finance) → [0.82, -0.14, 0.56, 0.09]
    bank (river) → [-0.21, 0.77, -0.63, 0.48]







    The numbers themselves don’t matter.

    What matters is distance and direction between vectors.


    This is how the model distinguishes meaning.


    3. Positional Encoding - Preserving Word Order


    Embeddings capture meaning — but not order.

    Without positional information, these two sentences look identical to the model:
    • “The dog chased the cat”
    • “The cat chased the dog”


    Positional encoding injects order information into each word embedding.


    So now we have:






    Embedding + Position







    4. Self-Attention (The Core Idea)

    Once embeddings + positional data are ready, they pass through the self-attention layer.


    Self-attention assigns a weight to every word relative to every other word.


    This allows the model to:
    • Focus on relevant relationships
    • Ignore irrelevant ones


    Why self-attention exists?


    Not all words matter equally.


    In the sentence:


    “The fisherman caught the fish with a net”


    The model needs to figure out:
    • Does “with a net” describe fisherman or fish?





    5. Multi-Head Self-Attention - Looking at Multiple Meanings at Once


    A single attention pattern isn’t enough.

    Different relationships exist at the same time:
    • grammatical
    • semantic
    • long-range dependencies


    Multi-head attention solves this by running multiple attention layers in parallel.


    Each head learns a different aspect of language:
    • one may focus on subject–verb relationships
    • another on modifiers
    • another on overall context





    6. Feed-Forward Network

    After attention, the representation goes into a feed-forward network.


    What happens here?
    • The feed-forward layer helps the model decide what word should come next.
    • It does this by assigning a score to every word in the model’s vocabulary.
    • If the vocabulary contains 50,000 tokens, the output is a list of 50,000 scores.
    • These scores are called logits.



    Example:

    For sentence: "The cat is ..."
    Logits →
    [2.3, 4.97, 84.21, -5.65, ...]

    where:
    - “sleeping” → very high score
    - “running” → medium score
    - “apple” → very low score






    At this stage:
    • These are raw scores
    • They are not probabilities
    • Higher score = more likely next word


    7. Softmax Output


    The logits are passed through a softmax function.

    Softmax:
    • converts scores into probabilities (0 → 1)
    • ensures they add up to 1


    Now the model has a probability distribution over all possible next words.

    The word with the highest probability is selected.



    🔄 Putting It All Together: Encoder → Decoder Flow




    Transformers are split into two major parts:
    • Encoder (Left side in the above image)
    • Decoder (Right side in the above image)


    Let’s walk through them using an example.






    Example Prompt:
    "Write a short story about dragon"







    🔐 Encoder Flow

    1. Prompt → Tokens
    2. Tokens → Vector Embeddings
    3. Embeddings + Positional Encoding
    4. Multi-Head Self-Attention


    The encoder produces a rich contextual representation.


    It learns things like:
    • “story” relates to “dragon”
    • “short” modifies “story”
    • overall intent of the prompt


    This output is not text — it’s meaning.





    🎯 Decoder Flow (Word by Word Generation)

    The decoder generates text one word at a time.


    Step 1: Start Token

    Initially, the decoder receives:














    Because during training, the model learned patterns like:
    • “Write a story about…”
    • “Tell a story about…”


    Many stories statistically start with:






    "Once upon a time"







    So the model predicts:






    Once







    The same process repeats for the next word, producing:






    Once upon







    Step 2: Masked Self-Attention

    Masked self-attention ensures the model cannot see future words.


    It allows:
    • “Once” → can see
    • “upon” to look at both and Once
    • but "Once" cannot attend to later tokens like upon, even though they are already part of the input


    Step 3: Cross-Attention

    Masked self-attention only looks at generated words.

    But the model also needs to remember:
    • what the user asked for
    • what the prompt means


    Why cross-attention exists?


    Cross-attention allows the decoder to:
    • look at the encoder’s output
    • align generated words with the prompt’s meaning


    For example, the encoder representation contains:
    • “story”
    • “dragon”


    So when generating words, the decoder is reminded:
    • this is a story
    • it must involve a dragon
    • tone should match the prompt


    Without cross-attention:
    • the model could drift off-topic
    • or generate generic text unrelated to the prompt

    Step 4: Predict Next Word

    At this stage, the decoder predicts the next word in three clear steps:


    1. Feed-Forward Network (Logits Generation)

    Based on the prompt and previously generated words, the feed-forward layer assigns a score to every word in the vocabulary.


    2. Softmax (Probability Distribution)

    The logits are passed through a softmax function, converting them into probabilities between 0 and 1, where all values sum to 1.


    3. Token Selection

    The word with the highest probability is chosen as the next token.






    Example:
    Once upon
    → next token: "there"







    The decoder input now becomes:






    Once upon there







    This loop repeats token by token until the output is complete.





    📝 Note on Modern LLMs

    The original Transformer architecture includes both an encoder and a decoder.


    However, many modern large language models (like GPT models) use a decoder-only architecture.


    In these models:
    • The prompt is treated as part of the input sequence
    • The model uses masked self-attention
    • There is no separate encoder block


    Despite this difference, the core idea — self-attention — remains the foundation.





    🌱 Final Takeaway

    LLMs don’t “understand” language like humans.


    They:
    • learn patterns
    • assign probabilities
    • repeat this process thousands of times per response


    But the Transformer architecture makes this process powerful by allowing:
    • global context
    • parallel processing
    • deep relationships between words


    Seeing how fast LLM apps like ChatGPT respond,

    I never imagined such a large, iterative process was running underneath.


    Once you understand this flow, LLMs stop feeling magical — and start feeling engineered.




    More...
Working...