TL;DR:
Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, and this now works across architectures (GPT-2 + TinyLlama) with near-zero variance.

I’ve been experimenting with depth-first pruning — removing entire layers based on sensitivity rather than shrinking model width.

Started on GPT-2…
Just validated it on TinyLlama 1.1B with full 3-seed replication.

🧠 Results (TinyLlama 1.1B)

Depth-First Pruning (3 seeds) Config Layers Reduction Test PPL Ratio ------------------------- ------- ---------- -------------- ------ Baseline (22L) 22 0% 9.19 1.000 20L (remove L4 + L11) 20 8.0% 9.72 ± 0.01 1.057 19L (staged pruning) 19 12.0% 9.94 ± 0.01 1.081

⚡ What’s interesting

Extremely stable → ±0.01 PPL across seeds
Transfers across GPT-2 and Llama-family models
Keeps quality within ~6–8% while reducing size
Produces real inference speedups, not just parameter savings

🧠 Key insight

Not all transformer layers matter equally.

Removing the least important layers:

preserves useful structure
avoids degrading all layers
beats uniform width pruning

🔥 Takeaway

👉 Structure > uniform scaling

Instead of:
“make every layer smaller”

Do:
👉 “remove the layers that matter least”

⚠️ Notes

Not a new architecture
Not claiming SOTA
Just a clean, reproducible efficiency method

🧠 Bigger picture

This is part of a broader direction I’m exploring:

Seed → architecture discovery (finds efficient models)
Magnus → memory-first reasoning system

Goal:

👉 smaller, structured systems instead of bigger models

submitted by /u/califalcon
[link] [comments]

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss