2*RTX 4090 vLLM Benchmark: Dual GPU Parallel Inference for Hugging Face LLM

With the growing adoption of large language models (LLMs), selecting the best GPU for 14B-16B(~30GB) model inference is a critical decision for AI developers and cloud providers. This article presents a vLLM benchmark on 2*RTX 4090, evaluating dual GPU parallel inference Hugging Face LLMs with vLLM performance tuning.

If you are looking for GPU recommendations for 14B-16B models, interested in vLLM 2*RTX4090 performance, or considering vLLM server rental, this benchmark will provide key insights.

Test Overview

1. A RTX4090 GPU Details(No NVLink):

  • GPU: NVIDIA RTX 4090
  • Microarchitecture: Ada Lovelace
  • Compute capability: 8.9
  • CUDA Cores: 16384
  • Tensor Cores: 512
  • Memory: 24GB GDDR6X
  • FP32 performance: 82.6 TFLOPS

2. Test Project Code Source:

3. The Following Models from Hugging Face were Tested:

  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/Qwen2.5-14B-Instruct
  • Qwen/Qwen2.5-14B-Instruct-1M
  • deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
  • deepseek-ai/deepseek-moe-16b-base

4. The Test Parameters are Preset as Follows:

  • Input length: 100 tokens
  • Output length: 600 tokens
  • --tensor-parallel-size: 2

5. We conducted two rounds of 2*RTX4090 vLLM tests under different concurrent request loads:

  • Scenario 1: 50 concurrent requests
  • Scenario 2: 100 concurrent requests

2*RTX 4090 Benchmark for Scenario 1: 50 Concurrent Requests

ModelsQwen2.5-7B-InstructQwen2.5-14B-InstructQwen2.5-14B-Instruct-1MDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-14Bdeepseek-moe-16b-base
Quantization161616161616
Size(GB)152328152831
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Request Numbers505050505050
Benchmark Duration(s)11.135.4526.6212.5025.6375.26
Total Input Tokens500050005000500050005000
Total Generated Tokens241622599114778281191707030000
Request (req/s)4.51.411.884.01.950.66
Input (tokens/s)450.3141.06187.79400.02195.0966.43
Output (tokens/s)2176.08733.28555.052249.69666.05398.62
Total Throughput (tokens/s)2626.38874.34742.842649.71861.14465.05
Median TTFT (ms)174.56911.54888.37475.44932.66397.10
P99 TTFT (ms)199.331302.341299.36740.771311.26477.03
Median TPOT (ms)18.1743.2733.2720.0132.1678.70
P99 TPOT (ms)19.2983.82374.7520.5066.40124.14
Median Eval Rate (tokens/s)55.0423.1130.0849.9831.0912.71
P99 Eval Rate (tokens/s)51.8411.932.6748.7815.068.06

✅ Key Takeaways:

  • 2*RTX 4090 performs well for 14B-16B models under moderate concurrency. DeepSeek-R1-Distill-Qwen-14B achieved 939.54 tokens/s. DeepSeek-moe-16B ran at 513.28 tokens/s.
  • P99 TTFT remains below 2 seconds for most 14B models, making it suitable for real-time inference.
  • Inference throughput is stable, proving 2*RTX 4090 can handle vLLM production workloads.

2*RTX 4090 Benchmark for Scenario 2: 100 Concurrent Requests

ModelsQwen2.5-7B-InstructQwen2.5-14B-InstructQwen2.5-14B-Instruct-1MDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-14Bdeepseek-moe-16b-base
Quantization161616161616
Size(GB)152328152831
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Request Numbers100100100100100100
Benchmark Duration(s)16.6971.0651.4117.0649.21134.11
Total Input Tokens100001000010000100001000010000
Total Generated Tokens486815071231722545143623358835
Request (req/s)5.991.411.955.862.030.75
Input (tokens/s)599.32140.73194.52586.04203.2274.57
Output (tokens/s)2917.58713.66617.033194.74736.32438.71
Total Throughput (tokens/s)3516.90854.39811.553780.78939.54513.28
Median TTFT(ms)823.921556.581443.161008.071506.46600.12
P99 TTFT(ms)1262.1040030.8436280.901466.3734223.0895406.51
Median TPOT(ms)26.6447.5943.9326.6542.6098.78
P99 TPOT(ms)212.21133.451964.7740.56176.95185.71
Median Eval Rate (tokens/s)37.5421.0122.7637.5223.4710.12
P99 Eval Rate (tokens/s)4.717.490.5124.655.655.38

✅ Key Takeaways:

  • 14B-16B models face significant performance degradation 100 concurrent requests.
  • DeepSeek-moe-16B P99 TTFT exceeded 90s, making it impractical for high-load applications. DeepSeek-R1 Distill-Qwen-14B struggled with throughput under 100 requests. The P99 TPOT of Qwen2.5-14B-Instruct-1M increased to 1964.77ms.
  • For models below 8B, performance remains strong. DeepSeek-R1 Distill-Llama-8B handled 100 requests with 3708.78 tokens/s throughput.

vLLM Performance Tuning Recommendations

✅ For 14B-16B models:

  • Set tensor-parallel-size: 2 to properly utilize both GPUs.
  • Limit concurrent requests to 50 for acceptable P99 TTFT.
  • Use continuous batching to maximize throughput.

✅ For 7B-8B Models:

  • Dual GPU parallel processing is much faster than a single GPU, but considering the cost, sometimes a single RTX 4090 may be sufficient, View the benchmark results of a single RTX4090...
  • Can handle 100+ concurrent requests efficiently without major slowdowns.

Production Deployment Best Practices

✅ For Latency-Sensitive Applications:

  • Prefer 7B-8B models for lower inference latency.
  • When using the 14B-16B model, limit concurrency to 50 to maintain fast responsiveness.

✅ When Renting a vLLM Server:

  • Ensure the provider supports multi-GPU configurations properly.
  • Request benchmarks for your specific model size and workload before committing.

✅ Scaling for Future Demands:

  • For larger models (16B+) or 100+ concurrency requests, consider upgrading to A100 or H100.
  • Monitor P99 TTFT closely as a key SLA metric for production stability.

Get Started with 2*RTX 4090 Server Hosting

vLLM 2*RTX 4090 provides stable inference, making it a cost-effective alternative to enterprise GPUs.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to April 30

Multi-GPU Dedicated Server- 2xRTX 4090

449.50/mo
50% OFF Recurring (Was $899.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
Flash Sale to April 30

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

Conclusion: When to Use 2*RTX 4090 for LLM Inference?

The dual RTX 4090 configuration proves to be a cost-effective solution for deploying 14B-16B(~30GB) LLMs in production environments, particularly when using vLLM with proper performance tuning. By limiting concurrent requests to 50 for larger models and utilizing tensor parallelism, developers can achieve stable performance with acceptable latency metrics.

For developers looking for vLLM server rental, multi-GPU inference, and cost-effective LLM hosting, 2*RTX 4090 provides strong performance without the high cost of enterprise GPUs.

However, for 32b LLMs and extreme concurrency (100+ requests), an enterprise-grade GPU like A100 or H100 is recommended.

Attachment: Video Recording of 2*RTX 4090 vLLM Benchmark

2*RTX 4090 vLLM Benchmark: 50 Concurrent Requests Test Hugging Face LLMs
2*RTX 4090 vLLM Benchmark: 100 Concurrent Requests Test Hugging Face LLMs
Screenshot: 2*RTX 4090 vLLM benchmark with 50 Concurrent Requests
Qwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-14B-Instruct-1Mdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14Bdeepseek-ai/deepseek-moe-16b-base
Screenshot: 2*RTX 4090 vLLM benchmark with 100 Concurrent Requests
Qwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-14B-Instruct-1Mdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14Bdeepseek-ai/deepseek-moe-16b-base

Data Item Explanation in the Table:

  • Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
  • Size(GB): Model size in GB.
  • Backend: The inference backend used. In this test, vLLM is used.
  • Successful Requests: The number of requests processed.
  • Benchmark duration(s): The total time to complete all requests.
  • Total input tokens: The total number of input tokens across all requests.
  • Total generated tokens: The total number of output tokens generated across all requests.
  • Request (req/s): The number of requests processed per second.
  • Input (tokens/s): The number of input tokens processed per second.
  • Output (tokens/s): The number of output tokens generated per second.
  • Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
  • Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
  • P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
  • Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
  • P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
  • Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
  • P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.
Tags:

tensor-parallel-size: 2, vllm performance tuning for multi-card parallelism, what gpu to use for 14-16b models, vllm benchmark, dual RTX4090 benchmark, 2RTX4090 benchmark results, dual RTX4090 test, vllm 2RTX4090, setting multi-card parallel reasoning llm, vllm server rental, 2*RTX4090 vs A100, vLLM production deployment, LLM inference optimization, cloud GPU rental