Models | Qwen2.5-3B-Instruct | Qwen2.5-VL-7B-Instruct | gemma-2-9b-it | DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-7B |
---|---|---|---|---|---|---|
Quantization | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 6.2GB | 16.6GB | 18.5GB | 3.6GB | 16.1GB | 15.2GB |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
CPU Rate | 1.4% | 1.4% | 1.4% | 1.4% | 1.4% | 1.4% |
RAM Rate | 3.5% | 3.9% | 3.9% | 4.5% | 3.3% | 4.6% |
GPU vRAM Rate | 89% | 84% | 84% | 90% | 91% | 91% |
GPU UTL | 64% | 94% | 94% | 64% | 81-89% | 85% |
Request (req/s) | 10.31 | 5.79 | 0.68 | 13.85 | 3.96 | 5.96 |
Total Duration | 28s | 51s | 7min23s | 21s | 1min15s | 50s |
Input (tokens/s) | 1030.59 | 579.34 | 67.59 | 1385.13 | 395.62 | 596.11 |
Output (tokens/s) | 6183.51 | 3476.04 | 405.54 | 8310.77 | 2373.68 | 3576.69 |
Total Throughput (tokens/s) | 7214.10 | 4055.38 | 473.13 | 9695.90 | 2769.30 | 4172.80 |
Models | Qwen2.5-3B-Instruct | Qwen2.5-VL-7B-Instruct | gemma-2-9b-it | DeepSeek-R1-Distill-Qwen-1.5B | DeepSeek-R1-Distill-Llama-8B | DeepSeek-R1-Distill-Qwen-7B |
---|---|---|---|---|---|---|
Quantification | 16 | 16 | 16 | 16 | 16 | 16 |
Size(GB) | 6.2GB | 16.6GB | 18.5GB | 3.6GB | 16.1GB | 15.2GB |
Backend/Platform | vLLM | vLLM | vLLM | vLLM | vLLM | vLLM |
CPU Rate | 2.8% | 2.5% | 1.6% | 3.0% | 2.5% | 2.6% |
RAM Rate | 3.9% | 5.2% | 4.3% | 4.8% | 3.7% | 4.9% |
GPU vRAM Rate | 89% | 94% | 84% | 89% | 90% | 90% |
GPU UTL | 80% | 88%-92% | 94% | 46%-78% | 82%-93% | 75%-100% |
Request (req/s) | 10.59 | 7.54 | 13.58 | 12.85 | 4.14 | 6.21 |
Total Duration | 28s | 39s | 22s | 23s | 1min12s | 48s |
Input (tokens/s) | 1058.64 | 753.84 | 1357.62 | 1285.32 | 414.28 | 620.5 |
Output (tokens/s) | 5921.67 | 3255.45 | 305.51 | 6980.81 | 2285.44 | 3344.91 |
Total Throughput (tokens/s) | 6980.31 | 4009.29 | 1663.13 | 8266.13 | 2699.72 | 3965.41 |
Interested in optimizing your LLM deployment? Check out vLLM server rental services or explore alternative GPUs for high-end AI inference.
Enterprise GPU Dedicated Server - RTX 4090
Enterprise GPU Dedicated Server - A100
Enterprise GPU Dedicated Server - A100(80GB)
Enterprise GPU Dedicated Server - H100
If you're running Hugging Face LLMs on RTX 4090, vLLM optimizations make it a great choice for smaller models. However, for models 8B and larger, a higher VRAM GPU (e.g., A100, H100) is recommended.
This benchmark highlights RTX 4090's strengths in handling efficient LLM inference, making it ideal for developers, AI researchers, and vLLM server rentals.
RTX4090 Server Rental,vLLM benchmark, RTX 4090 LLM test, best LLM GPU, Hugging Face vLLM, AI inference RTX 4090, LLM GPU comparison, vLLM server rental, best GPU for AI inference, vLLM vs HF Transformers, DeepSeek LLM benchmark, AI performance tuning.