•1 min read•from Machine Learning
[D] - 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings
Wrote up the process of pushing Qwen 3.5 27B (dense, FP8) to 1.1M total tok/s on 96 B200 GPUs with vLLM v0.18.0.
- DP=8 nearly 4x'd throughput over TP=8. Model is too small for tensor parallelism to help on B200s.
- MTP-1 mattered more than anything else (GPU utilization was 0% without it). MTP-5 crashed with cudaErrorIllegalAddress.
- 97.1% scaling efficiency at 8 nodes, 96.5% at 12. TPOT flat at ~46ms regardless of node count.
- Inference Gateway (KV-cache-aware routing) added ~35% overhead vs ClusterIP round-robin. Single EPP pod is the bottleneck.
InferenceMAX methodology, input-len=1024, output-len=512, 0% prefix cache hit. Worst-case numbers.
disclosure: I work for Google Cloud.
[link] [comments]
Want to read more?
Check out the full article on the original site
Tagged with
#financial modeling with spreadsheets
#rows.com
#google sheets
#cloud-based spreadsheet applications
#cloud-native spreadsheets
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#Qwen 3.5
#27B
#B200 GPUs
#1.1M tokens/second
#FP8
#vLLM v0.18.0
#MTP-1
#tensor parallelism
#scaling efficiency
#DP=8
#TP=8
#Inference Gateway