Models | gemma-3-4b-it | gemma-3-12b-it | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B | deepseek-moe-16b-base | gemma-3-12b-it |
---|---|---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 8.1 | 23 | 15 | 28 | 15 | 28 | 31 | 23 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 30 |
Benchmark Duration(s) | 8.46 | 66.15 | 11.73 | 42.17 | 12.45 | 28.27 | 68.60 | 45.04 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 | 3000 |
Total Generated Tokens | 28634 | 26587 | 24536 | 25085 | 27698 | 17406 | 28835 | 15035 |
Request (req/s) | 5.91 | 0.76 | 4.26 | 1.19 | 4.02 | 1.77 | 0.73 | 0.67 |
Input (tokens/s) | 591.2 | 75.58 | 426.14 | 138.57 | 401.7 | 176.84 | 72.88 | 66.61 |
Output (tokens/s) | 3385.72 | 401.91 | 2091.15 | 574.88 | 2225.26 | 615.61 | 420.31 | 333.83 |
Total Throughput (tokens/s) | 3976.92 | 477.49 | 2517.29 | 713.45 | 2626.96 | 792.45 | 493.19 | 400.44 |
Median TTFT (ms) | 234.86 | 458.93 | 342.12 | 648.38 | 332.17 | 588.39 | 412.27 | 284.54 |
P99 TTFT (ms) | 338.76 | 50692.78 | 512.94 | 860.94 | 520.55 | 849.89 | 529.31 | 547.45 |
Median TPOT (ms) | 13.66 | 35.76 | 18.94 | 41.83 | 20.13 | 31.85 | 68.87 | 46.35 |
P99 TPOT (ms) | 45.24 | 135.74 | 93.93 | 75.33 | 24.00 | 55.45 | 112.51 | 90.72 |
Median Eval Rate (tokens/s) | 73.20 | 27.96 | 52.80 | 23.91 | 49.68 | 31.40 | 14.52 | 21.57 |
P99 Eval Rate (tokens/s) | 22.10 | 7.37 | 10.65 | 13.27 | 14.67 | 18.03 | 8.89 | 11.02 |
For those running Gemma3-12B, DeepSeek-R1, or Qwen models, A100 provides a cost-effective hosting option without sacrificing performance.
If you're looking for an affordable yet powerful GPU to run Gemma3-12B, the A100 40GB and 2*RTX4090 is the best budget option. With 40GB VRAM usage, it efficiently handles Gemma3 4B and 12B inference while keeping costs low.
To optimize inference performance and minimize latency for Gemma3-12B, we recommend limiting concurrent requests to 30. This prevents excessive queuing and ensures faster token generation times, improving real-time responsiveness in LLM deployments.
A100 40GB provides stable inference performance, making it a cheap choice for Gemma3-12B Hosting and Qwen-14B Hosting.
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - RTX A6000
Multi-GPU Dedicated Server- 2xRTX 4090
Enterprise GPU Dedicated Server - A100(80GB)
A100 vLLM, A100 inference, cheap GPU LLM inference, Gemma 3 hosting, Gemma 3-4B hosting,Gemma 3-12B hosting, vLLM benchmark, A100 40GB vs H100, best GPU for LLM, affordable LLM inference, multi-GPU inference, vLLM distributed architecture