Research Problems in LLM Pretraining
What is pretraining
Pretraining = solve a single optimization problem (minimize next-token prediction loss) at unprecedented scale.
.gif)
Design Space in Pretraining
High-dimensional search space, expensive per-trial cost. Interactions between areas make joint optimization intractable.
.png)
1. Scaling Laws
- Given compute budget C, what is the optimal allocation between model size N and data size D?
- Power-law relationship: L(N,D) = E + A/N^α + B/D^β
- Ref: Kaplan et al. 2020 — Neural Scaling Laws
- Ref: Hoffmann et al. 2022 (Chinchilla) — compute-optimal training
2. Model Architecture
- Dense vs. Mixture-of-Experts (MoE): sparsity, granularity, expert count, active params
- Attention variants: KV sharing, grouped-query attention, hybrid local/global
- Ref: Clark et al. 2022 — Unified scaling laws for routed language models
3. Data Selection & Mixture
- Which data, in what proportions, in what order?
- Ref: Xie et al. 2023 (DoReMi) — domain reweighting with minimax optimization