Models | meta-llama/Llama-3.1-8B-Instruct | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | Qwen/Qwen2.5-14B-Instruct | Qwen/Qwen2.5-7B-Instruct | google/gemma-3-12b-it |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 15 | 28 | 28 | 15 | 23 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 21.81 | 23.81 | 38.77 | 47.58 | 23.48 | 65.02 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 22170 | 27517 | 17321 | 25213 | 25703 | 27148 |
Request (req/s) | 2.29 | 2.10 | 1.29 | 1.05 | 2.13 | 0.77 |
Input (tokens/s) | 229.27 | 210.02 | 128.96 | 105.08 | 212.93 | 76.9 |
Output (tokens/s) | 1016.60 | 1155.80 | 446.73 | 529.88 | 1094.54 | 417.54 |
Total Throughput (tokens/s) | 1245.87 | 1365.82 | 575.69 | 634.96 | 1307.47 | 494.44 |
Median TTFT(ms) | 219.88 | 756.72 | 1153.98 | 1434.29 | 761.11 | 401.22 |
P99 TTFT(ms) | 242.82 | 940.90 | 1486.31 | 1847.22 | 983.28 | 573.97 |
Median TPOT(ms) | 35.96 | 38.41 | 63.45 | 76.88 | 37.82 | 107.74 |
P99 TPOT(ms) | 36.12 | 41.08 | 95.96 | 80.39 | 190.41 | 115.72 |
Median Eval Rate(tokens/s) | 27.81 | 26.03 | 15.76 | 13.01 | 26.44 | 9.28 |
P99 Eval Rate(tokens/s) | 27.69 | 24.34 | 10.42 | 12.44 | 5.25 | 8.64 |
Models | meta-llama/Llama-3.1-8B-Instruct | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | Qwen/Qwen2.5-14B-Instruct | Qwen/Qwen2.5-7B-Instruct | google/gemma-3-12b-it |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 15 | 28 | 28 | 15 | 23 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 100 | 100 | 100 | 100 | 100 | 100 |
Benchmark Duration(s) | 26.61 | 40.83 | 45.25 | 87.61 | 42.94 | 123.86 |
Total Input Tokens | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
Total Generated Tokens | 45375 | 54430 | 37135 | 51177 | 48267 | 50548 |
Request (req/s) | 3.76 | 2.45 | 2.21 | 1.14 | 2.33 | 0.81 |
Input (tokens/s) | 375.86 | 244.94 | 220.99 | 114.14 | 241.87 | 80.74 |
Output (tokens/s) | 1705.49 | 1333.21 | 820.62 | 584.14 | 1123.99 | 408.12 |
Total Throughput (tokens/s) | 2081.35 | 1578.15 | 1041.61 | 698.28 | 1356.86 | 488.86 |
Median TTFT(ms) | 487.81 | 671.18 | 546.93 | 1426.07 | 1093.55 | 1865.75 |
P99 TTFT(ms) | 1065.56 | 1422.67 | 1838.40 | 5381.54 | 2443.18 | 4022.71 |
Median TPOT(ms) | 43.43 | 66.86 | 74.32 | 143.52 | 69.68 | 154.06 |
P99 TPOT(ms) | 64.93 | 67.43 | 118.59 | 166.48 | 793.62 | 1091.31 |
Median Eval Rate(tokens/s) | 23.03 | 14.86 | 13.45 | 6.97 | 14.35 | 6.49 |
P99 Eval Rate(tokens/s) | 43.42 | 14.83 | 8.43 | 6.01 | 1.26 | 0.92 |
Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.
Enterprise GPU Dedicated Server - RTX 4090
Enterprise GPU Dedicated Server - A40
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
The NVIDIA A40 is excellent for 7B-8B models (DeepSeek, Llama-3, Qwen-7B) even at 100+ concurrent requests. While 14B models work, they require optimizations (--max-model-len 4096) and thermal monitoring.
The A40 is a robust and cost-effective GPU for LLM inference up to 14B, comfortably supporting up to 100 concurrent requests. With minor tuning, it is well-suited for multi-user inference workloads and production-scale deployments of open-weight LLMs. For scalable hosting of LLaMA, Qwen, DeepSeek and other 7–14B models, the A40 stands out as a reliable mid-tier GPU choice.
A40 Hosting, A40 vLLM benchmark, A40 LLM performance, A40 vs A6000 for LLMs, Best LLM for A40 GPU, vLLM A40 throughput, A40 14B model benchmark, A40 thermal management LLMs, DeepSeek vs Llama-3 A40, Qwen2.5 A40 speed test, Gemma-3 A40 compatibility