3×V100 vLLM Benchmark: Multi-GPU Inference Performance and Optimization

In the evolving landscape of LLM inference, balancing GPU utilization, model size, and inference speed is critical for efficiency. We present a comprehensive benchmark of large language model (LLM) inference performance on 3×V100 GPUs using vLLM, a high-throughput and memory-efficient inference engine. This report presents the vLLM benchmark results for 3×V100 GPUs, evaluating different models under 50 and 100 concurrent requests. By leveraging tensor-parallel and pipeline-parallel strategies, we tested various configurations to optimize performance while maintaining compatibility with models up to 14B parameters.

Test Overview

1. A Single V100 GPU Details:

  • GPU: Nvidia V100
  • Microarchitecture: Volta
  • Compute capability: 7.0
  • CUDA Cores: 5120
  • Tensor Cores: 640
  • Memory: 16GB HBM2
  • FP32 performance: 14 TFLOPS

2. Test Project Code Source:

3. The Following Models from Hugging Face were Tested:

  • google/gemma-3-4b-it
  • google/gemma-3-12b-it
  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/Qwen2.5-14B-Instruct
  • deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

4. The Online Test Parameters are Preset as Follows:

  • Input length: 100 tokens
  • Output length: 600 tokens
  • 50 / 100 concurrent requests

5. Optimizing vLLM Multi-GPU Inference

  • Tensor Parallelism: Provides faster inference but requires even-numbered partitions (e.g., 2, 4, 8). Splits model layers across 2 GPUs (faster but memory-limited).
  • Pipeline Parallelism: Allows running larger models by increasing the number of partitions, but at the cost of some efficiency. Distributes layers across 3 GPUs (slower but supports larger models).
  • Float16 Precision: v100 only supports Float16 precision, but gemma3 uses bfloat16 by default. Manually adjusting the parameter to --dtype float16 allows v100 to infer the newer model to solve the compatibility issue.

6. Given these constraints, we adjusted parameters dynamically:

  • Tensor Parallel (TP=2) was used wherever possible for faster performance.
  • Pipeline Parallel (PP=3) was utilized for models too large to fit under TP=2.

3*V100 Benchmark for Scenario 1: 50 Concurrent Requests

Modelsgemma-3-4b-itgemma-3-12b-itQwen2.5-7B-InstructQwen2.5-14B-InstructDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-14B
Quantization161616161616
Size(GB)8.12315281528
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Tensor Parallelism212121
Pipeline Parallelism131313
Data Typefloat16float16float16float16float16float16
Request Numbers505050505050
Benchmark Duration(s)18.9477.8821.3551.5122.7646.56
Total Input Tokens500050005000500050005000
Total Generated Tokens300003000024552253702798017336
Request (req/s)2.640.642.340.972.21.07
Input (tokens/s)263.9664.2234.1697.06219.69107.38
Output (tokens/s)1583.76385.191149.84492.501229.37372.32
Total Throughput (tokens/s)1847.72449.391384.00589.561449.06479.70
Median TTFT (ms)688.761156.65800.711278.92903.351238.78
P99 TTFT (ms)713.251347.99866.981500.201014.311461.00
Median TPOT (ms)30.4196.9934.2683.7336.4476.06
P99 TPOT (ms)31.02128.0294.2585.2938.0590.38
Median Eval Rate (tokens/s)32.8827.9629.1911.9427.4413.15
P99 Eval Rate (tokens/s)32.247.3710.6111.7226.2811.06

✅ Performance Insights: 50 Concurrent Requests

  • Gemma 3-12B (TP=1, PP=3): Achieved a total throughput of 449.39 tokens/s with a median TTFT of 1347.99ms.
  • Qwen 7B (TP=2, PP=1): Reached 1,384 tokens/s, significantly outperforming Gemma 3-12B due to its lower parameter count.
  • Qwen 14B (TP=1, PP=3): Handled 589.56 tokens/s, demonstrating how pipeline parallelism sacrifices speed for compatibility.
  • Models using TP=2 (e.g., Gemma3-4B, Qwen2.5-7B) achieve higher throughput (1384-1847 tokens/s) and lower latency compared to PP=3 configurations.

A100 Benchmark for Scenario 2: 100 Concurrent Requests

We try to increase the number of requests to 100 to see how the 3*V100 GPU can handle it.
Modelsgemma-3-4b-itgemma-3-12b-itQwen2.5-7B-InstructQwen2.5-14B-InstructDeepSeek-R1-Distill-Llama-8BDeepSeek-R1-Distill-Qwen-14B
Quantization161616161616
Size(GB)8.12315281528
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Tensor Parallelism212121
Pipeline Parallelism131313
Data Typefloat16float16float16float16float16float16
Request Numbers100100100100100100
Benchmark Duration(s)28.28160.7533.1375.2931.4458.50
Total Input Tokens100001000010000100001000010000
Total Generated Tokens600006000049053500975468037976
Request (req/s)3.540.623.021.333.181.71
Input (tokens/s)353.5762.21301.84132.82318.04170.93
Output (tokens/s)2121.41373.251480.62665.361739.07649.12
Total Throughput (tokens/s)2474.98435.461782.46798.182057.11820.05
Median TTFT (ms)1166.771897.381402.622224.291426.772174.83
P99 TTFT (ms)1390.662276.371692.472708.261958.232649.59
Median TPOT (ms)45.16148.5752.86107.2249.6694.56
P99 TPOT (ms)45.58261.89377.16120.1159.43156.27
Median Eval Rate (tokens/s)22.146.7318.929.3320.1310.57
P99 Eval Rate (tokens/s)22.043.822.658.3316.836.40

✅ Performance Insights: 100 Concurrent Requests

  • As concurrency increased, median TTFT rose, and overall throughput declined due to increased queuing delays.
  • Gemma 3-12B (TP=2, PP=1) dropped to 435.46 tokens/s, highlighting that this newer model is also more GPU demanding.
  • Llama-8B (TP=1, PP=3) maintained 2057 tokens/s, showing pipeline parallelism can sustain performance across higher concurrency.

Key Findings of 3*v100 Benchmark Results

1. Tensor-parallel (TP) outperforms pipeline-parallel (PP)

Models using TP=2 (e.g., Gemma3-4B, Qwen2.5-7B, llama-8B) achieve higher throughput (1782–2474 tokens/s) and lower latency compared to PP=3 configurations.

2. Pipeline-parallel enables larger models

When TP fails (e.g., for 14B models), PP=3 allows execution but with reduced speed (e.g., Qwen2.5-14B: 798 tokens/s at 100 requests).

3. Gemma3-4B leads in efficiency

With TP=2, it achieves 2.64 req/s (50 reqs) and 3.54 req/s (100 reqs), the highest among tested models.

Optimization Insights and Best Practices for vLLM Performance

✅ Use Tensor Parallelism (TP=2) Whenever Possible:

This provides the best throughput. However, models exceeding VRAM limits require Pipeline Parallelism (PP=3).

✅ Make sure the tensor parallel size is an even number:

Since TP cannot be an odd number, deployments must use TP=2 or 4 instead of 3 or 5.

✅ Set --dtype float16 for newer Models:

Required for models like Gemma 3 to ensure they run on V100 GPUs efficiently.

✅ Balance Concurrency and Latency:

Limiting requests to 50 instead of 100 significantly improves token generation speed.

Get Started with 3*V100 GPU Server Hosting

Interested in optimizing your vLLM deployment? Check out GPU server rental services or explore alternative GPUs for high-end AI inference.

Multi-GPU Dedicated Server - 3xV100

469.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 3 x Nvidia V100
  • Microarchitecture: Volta
  • CUDA Cores: 5,120
  • Tensor Cores: 640
  • GPU Memory: 16GB HBM2
  • FP32 Performance: 14 TFLOPS
  • Expertise in deep learning and AI workloads with more tensor cores
Flash Sale to April 30

Multi-GPU Dedicated Server- 2xRTX 4090

449.50/mo
50% OFF Recurring (Was $899.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
Flash Sale to April 30

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc

Multi-GPU Dedicated Server - 2xA100

1099.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Free NVLink Included
  • A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.

Conclusion: 3*V100 is a Cheap Choice for LLMs Under 14B on Hugging Face

This 3×V100 vLLM benchmark demonstrates that carefully tuning tensor-parallel and pipeline-parallel parameters can maximize model performance while extending GPU capability to larger models. By following these best practices, researchers and developers can optimize LLM inference on budget-friendly multi-GPU setups.

Attachment: Video Recording of 3*V100 vLLM Benchmark

Screenshot: 3*V100 GPU vLLM Benchmark for 50 Request
google/gemma-3-4b-itgoogle/gemma-3-12b-itQwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B
Screenshot: 3*V100 GPU vLLM Benchmark for 100 Request
google/gemma-3-4b-itgoogle/gemma-3-12b-itQwen/Qwen2.5-7B-InstructQwen/Qwen2.5-14B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14B

Data Item Explanation in the Table:

  • Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
  • Size(GB): Model size in GB.
  • Backend: The inference backend used. In this test, vLLM is used.
  • --tensor-parallel-size: Specifies the number of GPUs used to split the model's tensors horizontally (layer-wise), reducing computation time but requiring high GPU memory bandwidth (e.g., --tensor-parallel-size 2 splits workloads across 2 GPUs).
  • --pipeline-parallel-size: Distributes model layers vertically (stage-wise) across GPUs, enabling larger models to run with higher memory efficiency but introducing communication overhead (e.g., --pipeline-parallel-size 3 divides layers sequentially across 3 GPUs).
  • --dtype: Defines the numerical precision (e.g., --dtype float16 for FP16) to balance memory usage and computational accuracy during inference. Lower precision (e.g., FP16) speeds up inference but may slightly reduce output quality.
  • Successful Requests: The number of requests processed.
  • Benchmark duration(s): The total time to complete all requests.
  • Total input tokens: The total number of input tokens across all requests.
  • Total generated tokens: The total number of output tokens generated across all requests.
  • Request (req/s): The number of requests processed per second.
  • Input (tokens/s): The number of input tokens processed per second.
  • Output (tokens/s): The number of output tokens generated per second.
  • Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
  • Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
  • P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
  • Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
  • P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
  • Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
  • P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.
Tags:

3×V100 vLLM benchmark, vLLM multi-GPU inference, tensor-parallel, pipeline-parallel, vLLM performance optimization, Gemma 3-12B inference, Qwen 7B benchmark, Llama-8B performance, LLM inference tuning, float16 precision, vLLM concurrency tuning