RTX4090 vLLM Benchmark: Best GPU for LLMs Below 8B on Hugging Face

As large language models (LLMs) continue to evolve, efficient GPU inference becomes a critical concern. Many AI developers and researchers are looking for an optimal balance between performance, cost, and efficiency when running LLMs on consumer-grade GPUs like the Nvidia RTX 4090.

In this benchmark, we tested various Hugging Face models on an RTX 4090 using vLLM, a high-performance inference framework designed to optimize LLM serving. The results provide valuable insights into vLLM performance, 4090 LLM inference speed, and the best LLM models for consumer GPUs.

Test Overview

1. RTX4090 GPU Details:

  • GPU: NVIDIA RTX 4090
  • Microarchitecture: Ada Lovelace
  • Compute capability: 8.9
  • CUDA Cores: 16384
  • Tensor Cores: 512
  • Memory: 24GB GDDR6X
  • FP32 performance: 82.6 TFLOPS

2. Test Project Code Source:

3. The Following Models from Hugging Face were Tested:

  • Qwen/Qwen2.5-3B-Instruct
  • Qwen/Qwen2.5-VL-7B-Instruct
  • google/gemma-2-9b-it
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  • deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

4. The Test Parameters are Preset as Follows:

  • Input length: 100 tokens
  • Output length: 600 tokens
  • Request Numbers: 300

5. The Test is Divided into Two Modes:

  • Offline testing: Usually all requests are sent at once, and the system needs to process a large number of requests at the same time, which may cause resource competition and memory pressure. More suitable for evaluating the maximum throughput and extreme performance of the system.
  • Online testing: To simulate real client requests, Requests are sent gradually, so the system can allocate resources more flexibly and reduce resource contention. More suitable for evaluating the performance of the system in an actual production environment.

RTX4090 Offline Benchmark Results Display

ModelsQwen2.5-3B-InstructQwen2.5-VL-7B-Instructgemma-2-9b-itDeepSeek-R1-Distill-Qwen-1.5BDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-7B
Quantization161616161616
Size(GB)6.2GB16.6GB18.5GB3.6GB16.1GB15.2GB
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
CPU Rate1.4%1.4%1.4%1.4%1.4%1.4%
RAM Rate3.5%3.9%3.9%4.5%3.3%4.6%
GPU vRAM Rate89%84%84%90%91%91%
GPU UTL64%94%94%64%81-89%85%
Request (req/s)10.315.790.6813.853.965.96
Total Duration28s51s7min23s21s1min15s50s
Input (tokens/s)1030.59579.3467.591385.13395.62596.11
Output (tokens/s)6183.513476.04405.548310.772373.683576.69
Total Throughput (tokens/s)7214.104055.38473.139695.902769.304172.80

✅ Key Takeaways:

  • 4090 is best suited for LLMs under 8B. Performance degrades significantly for larger models like Gemma-9B, likely due to VRAM limitations.
  • Qwen2.5-3B and DeepSeek-R1-Distill-1.5B achieved the highest throughput.
  • DeepSeek-R1-Distill-Qwen-7B and Llama-8B also performed well, but slower than 3B models.

RTX4090 Online Benchmark Results Display

ModelsQwen2.5-3B-InstructQwen2.5-VL-7B-Instructgemma-2-9b-itDeepSeek-R1-Distill-Qwen-1.5BDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-7B
Quantification161616161616
Size(GB)6.2GB16.6GB18.5GB3.6GB16.1GB15.2GB
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
CPU Rate2.8%2.5%1.6%3.0%2.5%2.6%
RAM Rate3.9%5.2%4.3%4.8%3.7%4.9%
GPU vRAM Rate89%94%84%89%90%90%
GPU UTL80%88%-92%94%46%-78%82%-93%75%-100%
Request (req/s)10.597.5413.5812.854.146.21
Total Duration28s39s22s23s1min12s48s
Input (tokens/s)1058.64753.841357.621285.32414.28620.5
Output (tokens/s)5921.673255.45305.516980.812285.443344.91
Total Throughput (tokens/s)6980.314009.291663.138266.132699.723965.41

✅ Key Takeaways:

  • 4090 is best suited for LLMs under 8B. Performance degrades significantly for larger models like Gemma-9B, likely due to VRAM limitations.
  • Online inference dynamically adjusts performance.
  • Total throughput is lower than offline tests due to auto-optimization.
  • Latency is reduced for most models, but output token rate is not as expected.

Data Item Explanation in the Table:

  • Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
  • Size (GB): Model size in GB.
  • Backend: The inference backend used. In this test, vLLM is used.
  • CPU Rate: CPU usage rate.
  • RAM Rate: Memory usage rate.
  • GPU vRAM: GPU memory usage.
  • GPU UTL: GPU utilization.
  • Request Numbers: The number of requests processed per second.
  • Total Duration: The total time to complete all requests.
  • Input (tokens/s): The number of input tokens processed per second.
  • Output (tokens/s): The number of output tokens generated per second.
  • Total Throughput (tokens/s): The total number of tokens processed per second (input + output).

Performance Analysis: Why Does Online Inference Differ?

The RTX 4090 vLLM online benchmark reveals that vLLM might prioritize latency over total output. Here’s why:

1. VRAM Bottlenecks on Larger Models

  • Gemma-9B’s low throughput suggests VRAM is insufficient.
  • Llama-8B and Qwen-7B suffer from similar issues.

2. vLLM Optimization Adjusts Output Dynamically

  • Online testing reduces output tokens if GPU utilization is too high.
  • This ensures faster response times at the cost of lower throughput.
  • Models that used close to 100% GPU VRAM in offline mode saw adjustments in online tests.

3. Trade-offs Between Latency and Throughput

  • DeepSeek-R1-Distill-1.5B had minimal latency impact.
  • Larger models struggled, likely due to how vLLM manages token output.

Best LLMs for RTX 4090: Recommendations

Based on vLLM benchmark results, RTX 4090 is best suited for models under 8B.

  • Best Performance: DeepSeek-R1-Distill-1.5B
  • Best Balance of Speed & Output: Qwen2.5-3B, DeepSeek-R1-Distill-Qwen-7B
  • Avoid for Large Models: Gemma-9B struggles due to VRAM limits
For developers looking to deploy LLMs on RTX 4090, choosing a model under 8B ensures stable performance without vLLM’s auto-adjustments impacting output tokens.

Get Started with RTX 4090 Server Rental!

Interested in optimizing your LLM deployment? Check out vLLM server rental services or explore alternative GPUs for high-end AI inference.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.
Flash Sale to April 30

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

Enterprise GPU Dedicated Server - H100

2099.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 183TFLOPS

Conclusion: Is RTX 4090 the Best Consumer GPU for vLLM?

If you're running Hugging Face LLMs on RTX 4090, vLLM optimizations make it a great choice for smaller models. However, for models 8B and larger, a higher VRAM GPU (e.g., A100, H100) is recommended.

This benchmark highlights RTX 4090's strengths in handling efficient LLM inference, making it ideal for developers, AI researchers, and vLLM server rentals.

Attachment: Video Recording of RTX4090 vLLM Benchmark

Video: RTX4090 vLLM Benchmark Offline Test Hugging Face LLMs
Video: RTX4090 vLLM Benchmark Online Test Hugging Face LLMs
Screenshot: RTX4090 vLLM benchmark offline results
Qwen/Qwen2.5-3B-InstructQwen/Qwen2.5-VL-7B-Instructgoogle/gemma-2-9b-itdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5Bdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Screenshot: RTX4090 vLLM benchmark online results
Qwen/Qwen2.5-3B-InstructQwen/Qwen2.5-VL-7B-Instructgoogle/gemma-2-9b-itdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5Bdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Tags:

RTX4090 Server Rental,vLLM benchmark, RTX 4090 LLM test, best LLM GPU, Hugging Face vLLM, AI inference RTX 4090, LLM GPU comparison, vLLM server rental, best GPU for AI inference, vLLM vs HF Transformers, DeepSeek LLM benchmark, AI performance tuning.