Skip to content

Course Tracks and Artifacts

This course is designed to be flexible across a wide range of system capabilities. The one hard constraint is that most of the exercises and libraries will only work on an M-Series Mac laptop or desktop.

Everything runs on one conceptual path, but there are several local-compute tracks, depending on your specific machine and its CPU/GPU, memory and storage. Students are encouraged to experiment with different sized models, training runs, and datasets. However the course loaders will choose sensible defaults, designed to produce good results for relatively small resource requirements.

System Probe

The repo includes a helpful system probe script that will assess which size models best fit your local machine:

./sysprobe.sh

You can use the output from that to help you decide on sizing models and datasets. If that clears your system for the standard track, then you can most likely use the defaults without issue.

What You Decide

There are two reasons to deviate from the course defaults. One is your system does not have the resources to support the standard track choices. The other is because you want to experiment with different, various or more powerful models, which is highly encouraged.

You'll need to make four independent choices across possible tracks. The conceptual path is identical across tracks. Any can be deferred and will fall back to course defaults. The track choices are below:

Decision Command Tiny Standard Stretch
Dataset footprint ./datasets.sh --tiny --small (no flag)
TinyLLM
(Modules 10-12)
Module 10 notebook ShakespeareLM-1M, StoryLM-5M StoryLM-30M, TinyLLM-30M TinyLLM-100M
BaseLM (Modules 13-16) ./baselm.sh 125M base 350M base 600M base
ProdLM (Modules 16-20) ./prodlm.sh 1.5B-3B instruct 7B-8B instruct 14B-class instruct

The recommendations for each track

Track Target Chip Memory Downloads Free disk
Tiny Fastest path, weaker hardware, quick Module 10 artifact Any Apple chip 8GB about 100MB 5-10GB
Standard Recommended local course experience M2 or better 24GB several GB 20-40GB
Stretch Stretch path for stronger machines and longer runs M3 or better (16+ GPU cores) 64GB 10GB+ 40-80GB

Datasets

Datasets are loaded with

./datasets.sh --tiny    # Tiny dataset track, a few hundred MBs
./datasets.sh --small   # Small dataset track, several GBs
./datasets.sh.          # Standard dataset track, around 10GB

All ./datasets.sh commands are intended to be idempotent. Rerunning should skip previously completed downloads and derived artifacts.

Track Raw data Tokenizers
Tiny TinyStories 100MB sample StoryTokenizer
Standard GloVe, full TinyStories, small G2C Corpus v1 StoryTokenizer, G2CTokenizer
Full GloVe, full TinyStories, full G2C Corpus v1 StoryTokenizer, G2CTokenizer
The normal ./setup.sh path handles the smallest datasets. It prepares the Python environment, TinyShakespeare, and the ShakespeareTokenizer artifact.

External models

BaseLM and ProdLM are prepared separately because they are model artifacts, not dataset tracks:

./baselm.sh <hf-model-id> [OPTIONAL]
./prodlm.sh <ollama-tag> [OPTIONAL]

Both scripts have sensible defaults if you omit the optional model tag. Pick any small base LM for BaseLM and any local instruct model for ProdLM that fits your machine. You can use sysprobe.sh to get a list of suggestions, or use any valid Hugging Face or Ollama tags:

./sysprobe.sh    # Run sysprobe checks against suggested candidate set
./sysprobe.sh --baselm-model  Qwen/Qwen3-0.6B-Base  # Eval for BaseLM
./sysprobe.sh --prodlm-model qwen3.5:9b             # Eval for ProdLM

You can use the scripts to download multiple models. The scripts are idempotent and cache the results, so you can run multiple times. The last tag the script is run with becomes the canonical BaseLM or ProdLM model:

./prodlm.sh llama3.2:3b.    # Downloads llama, sets it as ProdLM
./prodlm.sh qwen2.5:3b.     # Downloads qwen, ProdLM now points to qwen
./prodlm.sh llama3.2:3b.    # Qwen still cached, ProdLM now points to llama

The notebooks all allow you to specify any valid Hugging Face (BaseLM) or Ollama (ProdLM) model that you have previously downloaded. Re-running the same notebook with different models is a very good learning experience.

Note that ProdLM and BaseLM have different system requirements for equivalent sized models. BaseLM is trained locally, whereas ProdLM is just used for Ollama inference. Therefore the ceiling for BaseLM is substantially lower than ProdLM. For example the course defaults use a 360M param model for BaseLM and a 2B model for ProdLM.

Artifact Roles

Role Meaning
ShakespeareLM Tiny baseline model. Useful for proving the loop works, not for quality.
StoryLM TinyStories-trained model. More coherent stories, still not a general assistant.
TinyLLM Broader G2C-corpus model trained from scratch. Best self-trained candidate for assistant-shaped experiments.
BaseLM Small external pretrained base model for Modules 13-16
ProdLM Local pretrained instruct model for Modules 16-20. Served on Ollama.
<base>-SFT Derived artifacts produced by Module 13. The base is whichever model was fine-tuned.
<base>-DPO Derived DPO model produced in Module 14. The base is whichever model was fine-tuned. DPO artifacts are typically layered on top of the corresponding -SFT artifact.

Time Costs

Exact times vary with Mac generation, RAM, network, and whether MPS is available. Use these as planning ranges, not promises.

Step Typical track Time shape Cached output
./setup.sh all minutes .venv/, TinyShakespeare, ShakespeareTokenizer
GloVe download/extract Standard/Full minutes, network-bound data/embeddings/glove.6B.50d.txt
TinyStories sample Tiny minutes, network-bound compressed 100MB shards
Full TinyStories Standard/Full several minutes to tens of minutes compressed 100MB shards
G2C corpus small Standard tens of minutes-ish data/datasets/g2c-corpus-v1-small/
G2C corpus full Full long: network plus processing data/datasets/g2c-corpus-v1/
StoryTokenizer Tiny/Standard/Full minutes artifacts/tokenizers/StoryTokenizer/
G2CTokenizer Standard/Full minutes to tens of minutes artifacts/tokenizers/G2CTokenizer/
Tokenized TinyStories Tiny/Standard/Full minutes data/cache/token-corpus/StoryLM-*
Tokenized G2C corpus Standard/Full minutes to tens of minutes data/cache/token-corpus/TinyLLM-*
StoryLM 5M/30M training Module 10 minutes to about hour-class checkpoints and model artifacts
TinyLLM 30M-100M training Module 10/12 multi-hour or overnight checkpoints and model artifacts
BaseLM fetch Modules 13-16 model-size and network dependent ./baselm.sh, HF cache under data/cache/baselm/
ProdLM fetch Modules 16-20 model-size and network dependent ./prodlm.sh, external Ollama model cache

Downloads and tokenized corpora are one-time setup costs. Training runs are the recurring cost. Long training runs are setup to checkpoint so you can interrupt, inspect, sample, and continue.

Module Expectations

Modules Requirement pattern
01-03B Normal setup only.
04 Conceptual tokenizer exercises run small; artifact mini-milestone uses the dataset track you prepared.
05 Optional GloVe download for pretrained vector exercises.
06-09B Mostly lightweight. Use TinyShakespeare and small synthetic corpora.
10 Main artifact fork: ShakespeareLM baseline, StoryLM, and optional TinyLLM.
11 Uses the strongest saved Module 10 model it can find.
12 Scaling lab. Extends Module 10 and can stay small or go stretch.
13-15 Prefer a capable self-trained TinyLLM when available; otherwise run ./baselm.sh and use BaseLM.
16-18 Use ProdLM for the main assistant path. The strongest self-trained artifact can also be loaded for comparison.
19-20 ProdLM only. The deterministic exercises use a FakeBackend; live cells require ProdLM. Self-trained models are not currently exposed here because they do not reliably follow ReAct or multi-turn assistant formats.