Understanding Foundation Models

Summary

A recap of how training data, modeling choices, post-training, and sampling shape foundation model behavior.

Chapter Summary

This chapter discussed the core design decisions when building a foundation model. Since most people will use ready-made foundation models instead of training one from scratch, it skipped the nitty-gritty training details in favor of modeling factors that help you determine what models to use and how to use them.

What This Chapter Covered

Training Data

A crucial factor affecting a model's performance is its training data. Large models require a large amount of training data, which can be expensive and time-consuming to acquire. Model providers, therefore, often leverage whatever data is available.

Languages and Domains

Available data leads to models that perform well on many tasks present in the training data, which may not include the specific task you want. The chapter covered why it's often necessary to curate training data for specific languages, especially low-resource languages, and specific domains.

Model Architecture

After sourcing the data, model development can begin. Before training, an important step is architecting the model. The chapter looked into modeling choices, including architecture and size.

Transformer Models

The dominating architecture for language-based foundation models is the transformer. The chapter explored the problems the transformer architecture was designed to address, as well as its limitations.

Scale and Compute

The scale of a model can be measured by three key numbers:

Parameters

The number of parameters is one signal of a model's size and capacity.

Training Tokens

The number of training tokens reflects how much data the model is trained on.

FLOPs

The number of FLOPs needed for training is a proxy for training compute cost.

Two aspects that influence the amount of compute needed to train a model are the model size and the data size. The scaling law helps determine the optimal number of parameters and number of tokens given a compute budget.

This chapter also looked at scaling bottlenecks. Currently, scaling up a model generally makes it better. But how long will this continue to be true?

From Pre-Training to User Value

Due to the low quality of training data and self-supervision during pre-training, the resulting model might produce outputs that don't align with what users want.

Supervised Finetuning

Post-training starts by teaching the model to better follow instructions and conduct useful conversations.

Preference Finetuning

Preference finetuning further steers the model toward responses that align with human preference.
Human preference is diverse and impossible to capture in a single mathematical formula, so existing solutions are far from foolproof.

Sampling and Probabilistic Behavior

This chapter also covered one of my favorite topics: sampling, the process by which a model generates output tokens.

Sampling

Sampling makes AI models probabilistic.

Creative Strength

This probabilistic nature is what makes models like ChatGPT and Gemini great for creative tasks and fun to talk to.

Reliability Challenge

The same probabilistic nature also causes inconsistency and hallucinations.

Toward Systematic AI Engineering

Working with AI models requires building your workflows around their probabilistic nature. The rest of this book will explore how to make AI engineering, if not deterministic, at least systematic.

The first step toward systematic AI engineering is to establish a solid evaluation pipeline to help detect failures and unexpected changes. Evaluation for foundation models is so crucial that I dedicated two chapters to it, starting with the next chapter.
Copyright © 2026