2× RTX 5090 Ollama Benchmark: The Best Value GPU for 70B LLM Inference

Looking for the fastest and most cost-effective way to host 70B parameter large language models (LLMs) on your own infrastructure? Meet the dual RTX 5090 setup – the latest generation of NVIDIA consumer-grade GPUs that outperform the A100, rival the H100, and come in at a fraction of the cost.

In this benchmark report, we evaluate the performance of 2× RTX 5090 GPUs running DeepSeek-R1 70B, LLaMA 3.3 70B, and Qwen 2.5 72B & 110B models using Ollama 0.6.5. If you're researching RTX 5090 LLM inference, RTX 5090 Ollama benchmarks, or a cheaper alternative to A100/H100, this is the analysis you need.

Test Overview

Server Configs:

  • Price: $999.0/month
  • CPU: Dual Gold 6148
  • RAM: 256GB RAM
  • Storage: 240GB SSD + 2TB NVMe + 8TB SATA
  • Network: 1Gbps
  • OS: Ubuntu 22.0

A Single 5090 Details:

  • GPU: Nvidia GeForce RTX 5090
  • Microarchitecture: Ada Lovelace
  • Compute Capability: 12.0
  • CUDA Cores: 20,480
  • Tensor Cores: 680
  • Memory: 32 GB GDDR7
  • FP32 Performance: 109.7 TFLOPS

Framework:

  • Ollama 0.6.5

This configuration makes it an ideal 2*RTX 5090 hosting solution for deep learning, LLM inference, and AI model training.

Benchmarking Ollama Results on Nvidia 2*RTX5090

Modelsdeepseek-r1llama3.3qwen2.5qwen
Parameters70b70b72b110b
Size (GB)43434763
Quantization4444
Running onOllama0.6.5Ollama0.6.5Ollama0.6.5Ollama0.6.5
Downloading Speed(mb/s)113113113113
CPU Rate1.3%1.3%1.3%33-35%
RAM Rate2.1%2.1%2.1%2.1%
GPU Memory (2 cards)70.9%, 70.4%71%, 75%77.9%, 77.6%94%, 91%
GPU UTL (2 cards)45%, 48%47%, 45%45%, 48%20%, 20%
Eval Rate(tokens/s)27.0326.8524.157.22
Record real-time 2*RTX5090 gpu server resource consumption data:

Analysis & Insights

1. Best Performance per Dollar

The dual RTX 5090 configuration delivers Eval Rates up to 27 tokens/s for 70B models — matching H100 speeds while costing significantly less.

2. More Affordable Than H100

RTX 5090 (Consumer GPU): ~35–45% of the cost of H100. Delivers comparable results when hosting quantized 70B models with Ollama.

3. 64GB VRAM Limitation

  • While you can comfortably run 70B and 72B models with full GPU utilization, the 110B Qwen model struggled: GPU usage capped at 20%, Eval rate dropped to just 7.22 tokens/s.
  • This highlights that 64GB VRAM is not enough for smooth inference of 110B+ LLMs, even with quantization.

2*RTX5090 vs. 2*A100 vs. H100 for 70b LLMs on Ollama

When comparing the performance of the LLaMA 3.3 70B model on Ollama across three high-end GPU configurations, the results may surprise you:
MetricNvidia 2*RTX5090Nvidia H100Nvidia 2*A100 40GB
Modelsllama3.3:70bllama3.3:70bllama3.3:70b
Eval Rate(tokens/s)26.8524.3418.91
The dual RTX 5090 setup outperforms both H100 and 2× A100 40GB in terms of raw Eval Rate — delivering the highest tokens-per-second output for this 70B model in Ollama. This positions the RTX 5090 not just as a cost-effective choice, but as the performance leader in this category — ideal for developers and businesses running high-parameter LLMs without access to expensive enterprise-grade GPUs.

2*RTX 5090 GPU Hosting for LLMs

Our dedicated 2*RTX 5090 GPU server is optimized for LLM inference, fine-tuning, and deep learning workloads. With 64GB memory, it can handle ollama models up to 100B parameters efficiently.
Flash Sale to April 30

Enterprise GPU Dedicated Server - RTX A6000

329.00/mo
40% OFF Recurring (Was $549.00)
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia Quadro RTX A6000
  • Microarchitecture: Ampere
  • CUDA Cores: 10,752
  • Tensor Cores: 336
  • GPU Memory: 48GB GDDR6
  • FP32 Performance: 38.71 TFLOPS
  • Optimally running AI, deep learning, data visualization, HPC, etc.
New Arrival

Multi-GPU Dedicated Server- 2xRTX 5090

999.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual Gold 6148
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: 2 x GeForce RTX 5090
  • Microarchitecture: Ada Lovelace
  • CUDA Cores: 20,480
  • Tensor Cores: 680
  • GPU Memory: 32 GB GDDR7
  • FP32 Performance: 109.7 TFLOPS

Multi-GPU Dedicated Server - 2xA100

1099.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia A100
  • Microarchitecture: Ampere
  • CUDA Cores: 6912
  • Tensor Cores: 432
  • GPU Memory: 40GB HBM2
  • FP32 Performance: 19.5 TFLOPS
  • Free NVLink Included
  • A Powerful Dual-GPU Solution for Demanding AI Workloads, Large-Scale Inference, ML Training.etc. A cost-effective alternative to A100 80GB and H100, delivering exceptional performance at a competitive price.

Enterprise GPU Dedicated Server - H100

2099.00/mo
1mo3mo12mo24mo
Order Now
  • 256GB RAM
  • Dual 18-Core E5-2697v4
  • 240GB SSD + 2TB NVMe + 8TB SATA
  • 100Mbps-1Gbps
  • OS: Windows / Linux
  • GPU: Nvidia H100
  • Microarchitecture: Hopper
  • CUDA Cores: 14,592
  • Tensor Cores: 456
  • GPU Memory: 80GB HBM2e
  • FP32 Performance: 183TFLOPS

Conclusion: 2*RTX 5090 is the Ideal Choice for 70B LLMs

Whether you’re looking for the best GPU for LLaMA 3.3 70B, cheapest setup to run DeepSeek-R1 70B, or Ollama 5090 hosting benchmarks, the verdict is clear: 👉 2× RTX 5090 is the new sweet spot for LLM hosting up to 72B.

Tags:

Nvidia RTX 5090 Hosting, rtx 5090 ollama, dual rtx 5090 benchmark, rtx 5090 vs h100 inference, best gpu for 70b llm, 2x rtx 5090 llm inference, deepseek 70b benchmark, llama3 70b ollama, huggingface 70b gpu, ollama 5090 results, cheap gpu for large language models, 110b llm hardware requirements