Understanding Foundation Models

A guide to how training data, architecture, size, post-training, and sampling shape foundation model behavior.

Understanding Foundation Models

To build applications with foundation models, you first need foundation models. While you don't need to know how to develop a model to use it, a high-level understanding will help you decide what model to use and how to adapt it to your needs.

What This Chapter Can and Can't Do

Training a foundation model is an incredibly complex and costly process. Those who know how to do this well are likely prevented by confidentiality agreements from disclosing the secret sauce.

This chapter won't be able to tell you how to build a model to compete with ChatGPT. Instead, it'll focus on design decisions with consequential impact on downstream applications.

With the growing lack of transparency in the training process of foundation models, it's difficult to know all the design decisions that go into making a model. In general, however, differences in foundation models can be traced back to decisions about training data, model architecture and size, and how they are post-trained to align with human preferences.

The Decisions That Shape a Model

Training Data

Since models learn from data, their training data reveals a great deal about their capabilities and limitations. This chapter begins with how model developers curate training data, focusing on the distribution of training data.

Architecture

Given the dominance of the transformer architecture, it might seem that model architecture is less of a choice. You might be wondering what makes the transformer architecture so special that it continues to dominate.

Size

Whenever a new model is released, one of the first things people want to know is its size. This chapter will explore how a model developer might determine the appropriate size for their model.

Post-Training

Pre-training makes a model capable, but not necessarily safe or easy to use. Post-training aligns the model with human preferences, which has a significant impact on the model's usability.

Chapter 8 explores dataset engineering techniques in detail, including data quality evaluation and data synthesis.

Why Architecture Still Matters

Given the dominance of the transformer architecture, it might seem that model architecture is less of a choice.

From Capability to Usability

As mentioned in Chapter 1, a model's training process is often divided into pre-training and post-training.

Pre-Training

Pre-training makes a model capable, but not necessarily safe or easy to use.

Post-Training

Post-training aims to align the model with human preferences.

But what exactly is human preference? How can it be represented in a way that a model can learn? The way a model developer aligns their model has a significant impact on the model's usability, and will be discussed in this chapter.

The Underrated Role of Sampling

While most people understand the impact of training on a model's performance, the impact of sampling is often overlooked. Sampling is how a model chooses an output from all possible options. It is perhaps one of the most underrated concepts in AI.

Sampling explains many seemingly baffling AI behaviors, including hallucinations and inconsistencies. Choosing the right sampling strategy can also significantly boost a model's performance with relatively little effort.

For this reason, sampling is the section that I was the most excited to write about in this chapter.

How to Use This Chapter

Concepts covered in this chapter are fundamental for understanding the rest of the book. However, because these concepts are fundamental, you might already be familiar with them.

Feel free free to skip any concept that you're confident about. If you encounter a confusing concept later on, you can revisit this chapter.

Edit this pageorReport an issue

Summary

A recap of how foundation models gave rise to AI engineering, the application patterns enabled, and the framework this book provides.

Training Data

How training data quality, language coverage, and domain coverage shape foundation model capability, cost, and reliability.