Models | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | Qwen2.5-14B-Instruct-1M | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B | deepseek-moe-16b-base |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 23 | 28 | 15 | 28 | 31 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 50 | 50 | 50 | 50 | 50 | 50 |
Benchmark Duration(s) | 11.1 | 35.45 | 26.62 | 12.50 | 25.63 | 75.26 |
Total Input Tokens | 5000 | 5000 | 5000 | 5000 | 5000 | 5000 |
Total Generated Tokens | 24162 | 25991 | 14778 | 28119 | 17070 | 30000 |
Request (req/s) | 4.5 | 1.41 | 1.88 | 4.0 | 1.95 | 0.66 |
Input (tokens/s) | 450.3 | 141.06 | 187.79 | 400.02 | 195.09 | 66.43 |
Output (tokens/s) | 2176.08 | 733.28 | 555.05 | 2249.69 | 666.05 | 398.62 |
Total Throughput (tokens/s) | 2626.38 | 874.34 | 742.84 | 2649.71 | 861.14 | 465.05 |
Median TTFT (ms) | 174.56 | 911.54 | 888.37 | 475.44 | 932.66 | 397.10 |
P99 TTFT (ms) | 199.33 | 1302.34 | 1299.36 | 740.77 | 1311.26 | 477.03 |
Median TPOT (ms) | 18.17 | 43.27 | 33.27 | 20.01 | 32.16 | 78.70 |
P99 TPOT (ms) | 19.29 | 83.82 | 374.75 | 20.50 | 66.40 | 124.14 |
Median Eval Rate (tokens/s) | 55.04 | 23.11 | 30.08 | 49.98 | 31.09 | 12.71 |
P99 Eval Rate (tokens/s) | 51.84 | 11.93 | 2.67 | 48.78 | 15.06 | 8.06 |
Models | Qwen2.5-7B-Instruct | Qwen2.5-14B-Instruct | Qwen2.5-14B-Instruct-1M | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-14B | deepseek-moe-16b-base |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 15 | 23 | 28 | 15 | 28 | 31 |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
Request Numbers | 100 | 100 | 100 | 100 | 100 | 100 |
Benchmark Duration(s) | 16.69 | 71.06 | 51.41 | 17.06 | 49.21 | 134.11 |
Total Input Tokens | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
Total Generated Tokens | 48681 | 50712 | 31722 | 54514 | 36233 | 58835 |
Request (req/s) | 5.99 | 1.41 | 1.95 | 5.86 | 2.03 | 0.75 |
Input (tokens/s) | 599.32 | 140.73 | 194.52 | 586.04 | 203.22 | 74.57 |
Output (tokens/s) | 2917.58 | 713.66 | 617.03 | 3194.74 | 736.32 | 438.71 |
Total Throughput (tokens/s) | 3516.90 | 854.39 | 811.55 | 3780.78 | 939.54 | 513.28 |
Median TTFT(ms) | 823.92 | 1556.58 | 1443.16 | 1008.07 | 1506.46 | 600.12 |
P99 TTFT(ms) | 1262.10 | 40030.84 | 36280.90 | 1466.37 | 34223.08 | 95406.51 |
Median TPOT(ms) | 26.64 | 47.59 | 43.93 | 26.65 | 42.60 | 98.78 |
P99 TPOT(ms) | 212.21 | 133.45 | 1964.77 | 40.56 | 176.95 | 185.71 |
Median Eval Rate (tokens/s) | 37.54 | 21.01 | 22.76 | 37.52 | 23.47 | 10.12 |
P99 Eval Rate (tokens/s) | 4.71 | 7.49 | 0.51 | 24.65 | 5.65 | 5.38 |
vLLM 2*RTX 4090 provides stable inference, making it a cost-effective alternative to enterprise GPUs.
Enterprise GPU Dedicated Server - RTX 4090
Multi-GPU Dedicated Server- 2xRTX 4090
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
The dual RTX 4090 configuration proves to be a cost-effective solution for deploying 14B-16B(~30GB) LLMs in production environments, particularly when using vLLM with proper performance tuning. By limiting concurrent requests to 50 for larger models and utilizing tensor parallelism, developers can achieve stable performance with acceptable latency metrics.
For developers looking for vLLM server rental, multi-GPU inference, and cost-effective LLM hosting, 2*RTX 4090 provides strong performance without the high cost of enterprise GPUs.
However, for 32b LLMs and extreme concurrency (100+ requests), an enterprise-grade GPU like A100 or H100 is recommended.
tensor-parallel-size: 2, vllm performance tuning for multi-card parallelism, what gpu to use for 14-16b models, vllm benchmark, dual RTX4090 benchmark, 2RTX4090 benchmark results, dual RTX4090 test, vllm 2RTX4090, setting multi-card parallel reasoning llm, vllm server rental, 2*RTX4090 vs A100, vLLM production deployment, LLM inference optimization, cloud GPU rental