[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss
TL;DR:
Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, and this now works across architectures (GPT-2 + TinyLlama) with near-zero variance.
I’ve been experimenting with depth-first pruning — removing entire layers based on sensitivity rather than shrinking model width.
Started on GPT-2…
Just validated it on TinyLlama 1.1B with full 3-seed replication.
🧠 Results (TinyLlama 1.1B)
Depth-First Pruning (3 seeds) Config Layers Reduction Test PPL Ratio ------------------------- ------- ---------- -------------- ------ Baseline (22L) 22 0% 9.19 1.000 20L (remove L4 + L11) 20 8.0% 9.72 ± 0.01 1.057 19L (staged pruning) 19 12.0% 9.94 ± 0.01 1.081 ⚡ What’s interesting
- Extremely stable → ±0.01 PPL across seeds
- Transfers across GPT-2 and Llama-family models
- Keeps quality within ~6–8% while reducing size
- Produces real inference speedups, not just parameter savings
🧠 Key insight
Not all transformer layers matter equally.
Removing the least important layers:
- preserves useful structure
- avoids degrading all layers
- beats uniform width pruning
🔥 Takeaway
👉 Structure > uniform scaling
Instead of:
“make every layer smaller”
Do:
👉 “remove the layers that matter least”
⚠️ Notes
- Not a new architecture
- Not claiming SOTA
- Just a clean, reproducible efficiency method
🧠 Bigger picture
This is part of a broader direction I’m exploring:
- Seed → architecture discovery (finds efficient models)
- Magnus → memory-first reasoning system
Goal:
👉 smaller, structured systems instead of bigger models
[link] [comments]
Want to read more?
Check out the full article on the original site