7/27/2025

Foundations and Functionality of Large Language Models

20 tweets

4 min read

Alex Prompter

@alex_prompter

what are large language models actually doing? i read the 2025 textbook "Foundations of Large Language Models" by tong xiao and jingbo zhu and for the first time, i truly understood how they work. here’s everything you need to know about llms in 3 minutes↓

🖼️ 1 image loading...

Alex Prompter

@alex_prompter

to understand LLMs, you first need to know the idea of pre-training. instead of teaching a model to solve one task with labeled data (like classifying tweets), we train it on massive unlabeled text and let it "figure out" language patterns by itself. this is called self-supervised learning.

Alex Prompter

@alex_prompter

there are 3 pre-training strategies: → unsupervised: models learn patterns without any labels → supervised: models learn from labeled tasks → self-supervised: models generate their own labels from unlabeled data (e.g., predicting masked words) LLMs use the third one. it's the most powerful.

Alex Prompter

@alex_prompter

self-supervised pretraining works like this: take a sentence like “the early bird catches the worm.” mask some words: “the [MASK] bird catches the [MASK]” ask the model to fill in the blanks. no labels needed the text itself is the supervision.

🖼️ 1 image loading...

Alex Prompter

@alex_prompter

this idea leads to three main architecture types: → encoder-only (BERT): reads and understands text → decoder-only (GPT): generates the next word → encoder-decoder (T5, BART): reads input, generates output each has strengths. for example: • GPT is great at generation • BERT is great at classification • T5 can do both via a "text-to-text" framework

Alex Prompter

@alex_prompter

let’s dive into each. decoder-only (GPT-style): trained to predict the next token given previous ones: “the cat sat on the [MASK]” → “mat” this is called causal language modeling. loss is calculated using cross-entropy over predicted vs. actual next words.

🖼️ 1 image loading...

Alex Prompter

@alex_prompter

encoder-only (BERT-style): looks at the full sentence at once, masks random tokens, and tries to reconstruct them. this is masked language modeling, and it allows bidirectional context the model uses both left and right of the [MASK]. better for tasks like sentence classification.

Alex Prompter

@alex_prompter

example: original: "the early bird catches the worm" masked: "the [MASK] bird catches the [MASK]" the model learns to predict "early" and "worm" using all surrounding context. it's like learning to read by guessing missing words in a book.

Alex Prompter

@alex_prompter

encoder-decoder (T5-style): treats every NLP task as text-to-text. examples: “translate English to German: hello” → “hallo” “classify sentiment: i hate this” → “negative” the same model can be used for translation, summarization, QA, etc.

🖼️ 1 image loading...

Alex Prompter

@alex_prompter

so, what happens after pre-training? we can either: → fine-tune: train the model further on labeled task-specific data → prompt: write smart inputs to steer the model prompting is the reason LLMs feel magical. let's talk about that next.

Alex Prompter

@alex_prompter

prompting is just carefully crafting the input to make the model behave how you want. example: "i love this movie. sentiment:" the model will likely complete: “positive”. add a few examples before the input, and the model learns the pattern. this is in-context learning.

Alex Prompter

@alex_prompter

prompting can go deep. strategies include: • chain of thought: "let's think step by step…" • problem decomposition: break big problems into smaller ones • self-refinement: ask the model to critique and revise its own output • RAG: let the model look things up externally

🖼️ 1 image loading...

Alex Prompter

@alex_prompter

this is all possible because of the way these models are trained: predict the next word over and over until they internalize language structure, reasoning patterns, and world knowledge. it's not magic. it's scale.

Alex Prompter

@alex_prompter

but LLMs aren’t trained just to be smart — they need to be aligned with human values. how? → supervised fine-tuning (SFT): teach the model on human-written responses → RLHF (reinforcement learning from human feedback): train a reward model to prefer good outputs this is how ChatGPT was aligned.

Alex Prompter

@alex_prompter

alignment is hard. newer methods like Direct Preference Optimization (DPO) avoid the instability of RL and are becoming more popular. the goal is the same: steer models toward helpful, harmless, honest responses.

Alex Prompter

@alex_prompter

finally, inference matters. how do you actually run these massive models efficiently? → use smart decoding (top-k, nucleus sampling) → cache previous results → batch multiple requests → scale context with better memory and position interpolation this is how you get fast,

Alex Prompter

@alex_prompter

in short: LLMs work because they: 1. learn from massive text via self-supervision 2. use Transformers to model token sequences 3. can be prompted/fine-tuned for any task 4. are aligned with human preferences 5. are optimized for fast inference they're general-purpose text reasoning machines.

Alex Prompter

@alex_prompter

this was based on the brilliant textbook: "Foundations of Large Language Models" by Tong Xiao and Jingbo Zhu (NiuTrans Research Lab) arxiv: https://arxiv.org/abs/2501.09223v2… highly recommend it if you're serious about understanding LLMs deeply.

🔗 Link preview loading...

Alex Prompter

@alex_prompter

The AI prompt library your competitors don't want you to find → Unlimited prompts: $15/month → Starter pack: $3.99/month → Pro bundle: $9.99/month Grab it before it's gone ↓

🔗 Link preview loading...

Alex Prompter

@alex_prompter

I hope you've found this thread helpful. Follow me @alex_prompter for more. Like/Repost the quote below if you can:

https://twitter.com/alex_prompter/status/1949050402810523990/photo/1

Rate this thread

Help others discover quality content

Ready to create your own threads?

Get Started Free