Models | gemma-3-4b-it | gemma-3-12b-it | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 8.1 | 23 | 15 | 28 | 15 | 28 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Tensor Parallelism | 2 | 1 | 2 | 1 | 2 | 1 |
Pipeline Parallelism | 1 | 3 | 1 | 3 | 1 | 3 |
Data Type | float16 | float16 | float16 | float16 | float16 | float16 |
Request Numbers | 50 | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 18.94 | 77.88 | 21.35 | 51.51 | 22.76 | 46.56 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 30000 | 30000 | 24552 | 25370 | 27980 | 17336 |
Request (req/s) | 2.64 | 0.64 | 2.34 | 0.97 | 2.2 | 1.07 |
Input (tokens/s) | 263.96 | 64.2 | 234.16 | 97.06 | 219.69 | 107.38 |
Output (tokens/s) | 1583.76 | 385.19 | 1149.84 | 492.50 | 1229.37 | 372.32 |
Total Throughput (tokens/s) | 1847.72 | 449.39 | 1384.00 | 589.56 | 1449.06 | 479.70 |
Median TTFT (ms) | 688.76 | 1156.65 | 800.71 | 1278.92 | 903.35 | 1238.78 |
P99 TTFT (ms) | 713.25 | 1347.99 | 866.98 | 1500.20 | 1014.31 | 1461.00 |
Median TPOT (ms) | 30.41 | 96.99 | 34.26 | 83.73 | 36.44 | 76.06 |
P99 TPOT (ms) | 31.02 | 128.02 | 94.25 | 85.29 | 38.05 | 90.38 |
Median Eval Rate (tokens/s) | 32.88 | 27.96 | 29.19 | 11.94 | 27.44 | 13.15 |
P99 Eval Rate (tokens/s) | 32.24 | 7.37 | 10.61 | 11.72 | 26.28 | 11.06 |
Models | gemma-3-4b-it | gemma-3-12b-it | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 8.1 | 23 | 15 | 28 | 15 | 28 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Tensor Parallelism | 2 | 1 | 2 | 1 | 2 | 1 |
Pipeline Parallelism | 1 | 3 | 1 | 3 | 1 | 3 |
Data Type | float16 | float16 | float16 | float16 | float16 | float16 |
Request Numbers | 100 | 100 | 100 | 100 | 100 | 100 |
Benchmark Duration(s) | 28.28 | 160.75 | 33.13 | 75.29 | 31.44 | 58.50 |
Total Input Tokens | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
Total Generated Tokens | 60000 | 60000 | 49053 | 50097 | 54680 | 37976 |
Request (req/s) | 3.54 | 0.62 | 3.02 | 1.33 | 3.18 | 1.71 |
Input (tokens/s) | 353.57 | 62.21 | 301.84 | 132.82 | 318.04 | 170.93 |
Output (tokens/s) | 2121.41 | 373.25 | 1480.62 | 665.36 | 1739.07 | 649.12 |
Total Throughput (tokens/s) | 2474.98 | 435.46 | 1782.46 | 798.18 | 2057.11 | 820.05 |
Median TTFT (ms) | 1166.77 | 1897.38 | 1402.62 | 2224.29 | 1426.77 | 2174.83 |
P99 TTFT (ms) | 1390.66 | 2276.37 | 1692.47 | 2708.26 | 1958.23 | 2649.59 |
Median TPOT (ms) | 45.16 | 148.57 | 52.86 | 107.22 | 49.66 | 94.56 |
P99 TPOT (ms) | 45.58 | 261.89 | 377.16 | 120.11 | 59.43 | 156.27 |
Median Eval Rate (tokens/s) | 22.14 | 6.73 | 18.92 | 9.33 | 20.13 | 10.57 |
P99 Eval Rate (tokens/s) | 22.04 | 3.82 | 2.65 | 8.33 | 16.83 | 6.40 |
Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.
Multi-GPU Dedicated Server - 3xV100
Multi-GPU Dedicated Server- 2xRTX 4090
Enterprise GPU Dedicated Server - A100
Multi-GPU Dedicated Server - 2xA100
This 3×V100 vLLM benchmark demonstrates that carefully tuning tensor-parallel and pipeline-parallel parameters can maximize model performance while extending GPU capability to larger models. By following these best practices, researchers and developers can optimize LLM inference on budget-friendly multi-GPU setups.
3×V100 vLLM benchmark, vLLM multi-GPU inference, tensor-parallel, pipeline-parallel, vLLM performance optimization, Gemma 3-12B inference, Qwen 7B benchmark, Llama-8B performance, LLM inference tuning, float16 precision, vLLM concurrency tuning