A40 vLLM Benchmark: 7B to 14B LLMs Under 50 & 100 Concurrent Requests

As large language models (LLMs) become integral to modern AI deployments, efficient hosting with optimized inference latency is key. This report benchmarks the performance of the NVIDIA A40 (48GB) using the vLLM inference engine under 50 and 100 concurrent request conditions, testing various 7B to 14B models, including DeepSeek, Qwen, Meta-LLaMA, and Gemma3.

Test Overview

1. A40 GPU Details:

  • GPU: Nvidia A40
  • Microarchitecture: Ampere
  • Compute capability: 8.6
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • Memory: 48GB GDDR6
  • FP32 performance: 37.48 TFLOPS

2. Test Project Code Source:

3. The Following Models from Hugging Face were Tested:

  • meta-llama/Llama-3.1-8B-Instruct
  • deepseek-ai/DeepSeek-R1-Distill-Llama-8B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
  • Qwen/Qwen2.5-14B-Instruct
  • Qwen/Qwen2.5-7B-Instruct
  • google/gemma-3-12b-it

4. The Test Parameters are Preset as Follows:

  • Input length: 100 tokens
  • Output length: 600 tokens

5. We conducted two rounds of A40 vLLM tests under different concurrent request loads:

  • Scenario 1: 50 concurrent requests
  • Scenario 2: 100 concurrent requests

A40 Benchmark for Scenario 1: 50 Concurrent Requests

Modelsmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14BQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-7B-Instructgoogle/gemma-3-12b-it
Quantization161616161616
Size(GB)151528281523
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Request Numbers505050505050
Benchmark Duration(s)21.8123.8138.7747.5823.4865.02
Total Input Tokens500050005000500050005000
Total Generated Tokens221702751717321252132570327148
Request (req/s)2.292.101.291.052.130.77
Input (tokens/s)229.27210.02128.96105.08212.9376.9
Output (tokens/s)1016.601155.80446.73529.881094.54417.54
Total Throughput (tokens/s)1245.871365.82575.69634.961307.47494.44
Median TTFT(ms)219.88756.721153.981434.29761.11401.22
P99 TTFT(ms)242.82940.901486.311847.22983.28573.97
Median TPOT(ms)35.9638.4163.4576.8837.82107.74
P99 TPOT(ms)36.1241.0895.9680.39190.41115.72
Median Eval Rate(tokens/s)27.8126.0315.7613.0126.449.28
P99 Eval Rate(tokens/s)27.6924.3410.4212.445.258.64

✅ Key Takeaways:

  • 8B models (Llama-3, DeepSeek, Qwen-7B) perform optimally.
  • 14B models are usable but slower.
  • The Gemma-3-12B and DeepSeek 14B models begin to push the limits of the GPU memory, especially when using longer sequence lengths.
  • At 50 requests, the A40 shows strong baseline performance across models, especially in token generation and throughput. However, it’s clear from later results that the GPU is not fully saturated at this level for most models.

A6000 Benchmark for Scenario 2: 100 Concurrent Requests

Modelsmeta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14BQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-7B-Instructgoogle/gemma-3-12b-it
Quantization161616161616
Size(GB)151528281523
Backend/PlatformvLLMvLLMvLLMvLLMvLLMvLLM
Request Numbers100100100100100100
Benchmark Duration(s)26.6140.8345.2587.6142.94123.86
Total Input Tokens100001000010000100001000010000
Total Generated Tokens453755443037135511774826750548
Request (req/s)3.762.452.211.142.330.81
Input (tokens/s)375.86244.94220.99114.14241.8780.74
Output (tokens/s)1705.491333.21820.62584.141123.99408.12
Total Throughput (tokens/s)2081.351578.151041.61698.281356.86488.86
Median TTFT(ms)487.81671.18546.931426.071093.551865.75
P99 TTFT(ms)1065.561422.671838.405381.542443.184022.71
Median TPOT(ms)43.4366.8674.32143.5269.68154.06
P99 TPOT(ms)64.9367.43118.59166.48793.621091.31
Median Eval Rate(tokens/s)23.0314.8613.456.9714.356.49
P99 Eval Rate(tokens/s)43.4214.838.436.011.260.92

✅ Key Takeaways:

  • A40 can handle 8B models efficiently even at high concurrency.
  • 14B models (Qwen/DeepSeek) are usable but with higher latency.
  • Gemma-3-12B is not recommended for A40 due to poor scaling.
  • The A40 successfully handled 100 concurrent requests with all tested 14B models, proving that it is viable for real-world LLM inference at scale, especially for models like DeepSeek and Qwen.
  • Doubling to 100 requests shows a clear rise in token throughput — indicating that the A40 still had headroom at 50 requests, particularly for 14B models.

Insights for A40 vLLM Performance

✅ Best for 7B-8B Models:

DeepSeek-R1-Distill-Llama-8B offers the best balance of speed & efficiency. Llama-3.1-8B is a strong alternative with good scaling.

✅ 14B Models Possible but Slower:

Requires --max-model-len 4096 in vLLM for stability. Higher latency at 100+ requests—best for moderate workloads.

✅ Avoid Gemma3-12b 100 requests on A40

While testable, the Gemma-3-12B model demonstrates elevated latency and poor efficiency, suggesting architectural incompatibility or high resource demands.

⚠️Operational Notes

  • Enable --max-model-len 4096 when running 14B models. Without it, vLLM will fail to load them.
  • Temperature Warning: A40 can reach 80°C+ under load. It is recommended to manually trigger fan cooling for 5 minutes during high workloads.
  • The data of Gemma3-12b is not ideal and can only support 50 concurrent requests. If more concurrent requests are required, it is recommended to use a100 40gb to infer Gemma3-12b.

Get Started with Nvidia A40 Server Rental!

Interested in optimizing your vLLM deployment? Check out our cheap GPU server hosting services or explore alternative GPUs for high-end AI inference.

Enterprise GPU Dedicated Server - RTX 4090

409.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: GeForce RTX 4090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 16,384
  • Tensor Cores: 512
  • GPU Memory: 24 GB GDDR6X
  • FP32 Performance: 82.6 TFLOPS
  • Perfect for 3D rendering/modeling , CAD/ professional design, video editing, gaming, HPC, AI/deep learning.

Enterprise GPU Dedicated Server - A40

439.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A40
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 37.48 TFLOPS
  • Ideal for hosting AI image generator, deep learning, HPC, 3D Rendering, VR/AR etc.
Flash Sale to April 30

Enterprise GPU Dedicated Server - A100

469.00/mo
41% OFF Recurring (Was $799.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Good alternativeto A800, H100, H800, L40. Support FP64 precision computation, large-scale inference/AI training/ML.etc
New Arrival

Enterprise GPU Dedicated Server - A100(80GB)

1559.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 19.5 TFLOPS

Conclusion: A40 is the Cheapest Choice for 14B LLMs

The NVIDIA A40 is excellent for 7B-8B models (DeepSeek, Llama-3, Qwen-7B) even at 100+ concurrent requests. While 14B models work, they require optimizations (--max-model-len 4096) and thermal monitoring.

The A40 is a robust and cost-effective GPU for LLM inference up to 14B, comfortably supporting up to 100 concurrent requests. With minor tuning, it is well-suited for multi-user inference workloads and production-scale deployments of open-weight LLMs. For scalable hosting of LLaMA, Qwen, DeepSeek and other 7–14B models, the A40 stands out as a reliable mid-tier GPU choice.

Attachment: Video Recording of A40 vLLM Benchmark

Screenshot: A40 vLLM benchmark with 50 Concurrent Requests
meta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14BQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-7B-Instructgoogle/gemma-3-12b-it
Screenshot: A40 vLLM benchmark with 100 Concurrent Requests
meta-llama/Llama-3.1-8B-Instructdeepseek-ai/DeepSeek-R1-Distill-Llama-8Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-14BQwen/Qwen2.5-14B-InstructQwen/Qwen2.5-7B-Instructgoogle/gemma-3-12b-it

Data Item Explanation in the Table:

  • Quantization: The number of quantization bits. This test uses 16 bits, a full-blooded model.
  • Size(GB): Model size in GB.
  • Backend: The inference backend used. In this test, vLLM is used.
  • Successful Requests: The number of requests processed.
  • Benchmark duration(s): The total time to complete all requests.
  • Total input tokens: The total number of input tokens across all requests.
  • Total generated tokens: The total number of output tokens generated across all requests.
  • Request (req/s): The number of requests processed per second.
  • Input (tokens/s): The number of input tokens processed per second.
  • Output (tokens/s): The number of output tokens generated per second.
  • Total Throughput (tokens/s): The total number of tokens processed per second (input + output).
  • Median TTFT(ms): The time from when the request is made to when the first token is received, in milliseconds. A lower TTFT means that the user is able to get a response faster.
  • P99 TTFT (ms): The 99th percentile Time to First Token, representing the worst-case latency for 99% of requests—lower is better to ensure consistent performance.
  • Median TPOT(ms): The time required to generate each output token, in milliseconds. A lower TPOT indicates that the system is able to generate a complete response faster.
  • P99 TPOT (ms): The 99th percentile Time Per Output Token, showing the worst-case delay in token generation—lower is better to minimize response variability.
  • Median Eval Rate(tokens/s): The number of tokens evaluated per second per user. A high evaluation rate indicates that the system is able to serve each user efficiently.
  • P99 Eval Rate(tokens/s): The number of tokens evaluated per second by the 99th percentile user represents the worst user experience.
Tags:

A40 Hosting, A40 vLLM benchmark, A40 LLM performance, A40 vs A6000 for LLMs, Best LLM for A40 GPU, vLLM A40 throughput, A40 14B model benchmark, A40 thermal management LLMs, DeepSeek vs Llama-3 A40, Qwen2.5 A40 speed test, Gemma-3 A40 compatibility