[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Google DeepMind dropped Gemma 4 today:

Gemma 4 31B: dense, 256K context, redesigned architecture targeting efficiency and long-context quality

Gemma 4 26B A4B: MoE, 26B total / 4B active per forward pass, 256K context

Both are natively multimodal (text, image, video, dynamic resolution).

We got both running on MAX on launch day across NVIDIA B200 and AMD MI355X from the same stack. On B200 we're seeing 15% higher output throughput vs. vLLM (happy to share more on methodology if useful).

Free playground if you want to test without spinning anything up: https://www.modular.com/#playground

submitted by /u/carolinedfrasca
[link] [comments]

[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Want to read more?

Tagged with