Research Problems in LLM Pretraining

What is pretraining

Pretraining = solve a single optimization problem (minimize next-token prediction loss) at unprecedented scale.

next_token_prediction_hd (2).gif

Design Space in Pretraining

High-dimensional search space, expensive per-trial cost. Interactions between areas make joint optimization intractable.

design_space_pretraining (1).png

1. Scaling Laws

Given compute budget C, what is the optimal allocation between model size N and data size D?
Power-law relationship: L(N,D) = E + A/N^α + B/D^β
Ref: Kaplan et al. 2020 — Neural Scaling Laws
Ref: Hoffmann et al. 2022 (Chinchilla) — compute-optimal training

2. Model Architecture

Dense vs. Mixture-of-Experts (MoE): sparsity, granularity, expert count, active params
Attention variants: KV sharing, grouped-query attention, hybrid local/global
Ref: Clark et al. 2022 — Unified scaling laws for routed language models

3. Data Selection & Mixture

Which data, in what proportions, in what order?
Ref: Xie et al. 2023 (DoReMi) — domain reweighting with minimax optimization