Best LLM API Models for RAG

RAG workloads often need enough context for retrieved passages plus economical input pricing. This page starts with large context windows and practical cost signals.

50Models listed
1M input + 500K outputCost example tokens
USD / 1MNormalized prices

Quick shortlist

Start with Llama 4 Scout.

This guide is sorted by context window, so the first rows are the strongest starting point for RAG, long documents, and large codebase context.

Lead model Llama 4 Scout
ProviderMeta
Sample cost$0.25
Context10M

The ranking is a discovery aid, not a final recommendation. Always compare the model against your workload and verify provider pricing before production use.

How to read this ranking

Models are sorted by context window size. Use this page when your workflow needs long documents, large retrieval payloads, or multi-file context.

Model Ranking

Browse all models
ModelProviderPromptOutputSample costYour CostContextPopularityRelease
Llama 4 ScoutMeta$0.1$0.3$0.25$0.2510M
Grok 4.20xAI$1.25$2.5$2.5$2.52M#88
Grok 4.20 Multi-AgentxAI$1.25$2.5$2.5$2.52M#155
🔥GPT-5.5OpenAI$5$30$20$201.05M#19
GPT-5.4OpenAI$2.5$15$10$101.05M#30
GPT-5.5 ProOpenAI$30$180$120$1201.05M#161
GPT-5.4 ProOpenAI$30$180$120$1201.05M#251
OpenAI GPT LatestOpenAI$5$30$20$201.05M
🔥Owl AlphaOpenRouter$0$0$0$01.05M#7
Gemini 3.1 Pro Preview Custom ToolsGoogle$2$12$8$81.05M#120
🔥DeepSeek V4 FlashDeepSeek$0.09$0.18$0.18$0.181.05M#1
🔥MiMo-V2.5Xiaomi$0.105$0.28$0.24$0.241.05M#3
New🔥MiniMax M3MiniMax$0.3$1.2$0.9$0.91.05M#4
🔥DeepSeek V4 ProDeepSeek$0.435$0.87$0.87$0.871.05M#5
🔥Gemini 3 Flash PreviewGoogle$0.5$3$2$21.05M#10
🔥Gemini 2.5 FlashGoogle$0.3$2.5$1.55$1.551.05M#13
🔥Gemini 2.5 Flash LiteGoogle$0.1$0.4$0.3$0.31.05M#15
🔥Gemini 3.5 FlashGoogle$1.5$9$6$61.05M#16
🔥MiMo-V2.5-ProXiaomi$0.435$0.87$0.87$0.871.05M#17
🔥Gemini 3.1 Flash LiteGoogle$0.25$1.5$1$11.05M#20
Gemini 3.1 Pro PreviewGoogle$2$12$8$81.05M#31
Gemini 3.1 Flash Lite PreviewGoogle$0.25$1.5$1$11.05M#34
Gemini 2.5 ProGoogle$1.25$10$6.25$6.251.05M#57
Qwen3 Coder 480B A35BQwen$0.22$1.8$1.12$1.121.05M#59
Gemini 2.5 Flash Lite Preview 09-2025Google$0.1$0.4$0.3$0.31.05M#81
Lyria 3 Pro PreviewGoogle$0$0$0$01.05M#283
Lyria 3 Clip PreviewGoogle$0$0$0$01.05M#291
NewGLM 5.2Z.ai$0.95$3$2.45$2.451.05M
Google Gemini Pro LatestGoogle$2$12$8$81.05M
Google Gemini Flash LatestGoogle$1.5$9$6$61.05M
DeepSeek V4 Flash (free)DeepSeek$0$0$0$01.05M
MiMo-V2-ProXiaomi$1$3$2.5$2.51.05M
Qwen3 Coder 480B A35B (free)Qwen$0$0$0$01.05M
Gemini 2.5 Pro Preview 06-05Google$1.25$10$6.25$6.251.05M
Gemini 2.5 Pro Preview 05-06Google$1.25$10$6.25$6.251.05M
Llama 4 MaverickMeta$0.15$0.6$0.45$0.451.05M
Gemini 2.0 Flash LiteGoogle$0.075$0.3$0.22$0.221.05M
Gemini 2.0 FlashGoogle$0.1$0.4$0.3$0.31.05M
GPT-4.1 MiniOpenAI$0.4$1.6$1.2$1.21.05M#52
GPT-4.1 NanoOpenAI$0.1$0.4$0.3$0.31.05M#58
GPT-4.1OpenAI$2$8$6$61.05M#66
Palmyra X5Writer$0.6$6$3.6$3.61.04M#264
MiniMax-01MiniMax$0.2$1.1$0.75$0.751M#215
🔥Claude Sonnet 4.6Anthropic$3$15$10.5$10.51M#6
🔥Claude Opus 4.7Anthropic$5$25$17.5$17.51M#8
New🔥Nemotron 3 Ultra (free)NVIDIA$0$0$0$01M#12
🔥Claude Opus 4.6Anthropic$5$25$17.5$17.51M#18
Nemotron 3 Super (free)NVIDIA$0$0$0$01M#23
Qwen3.7 MaxQwen$1.25$3.75$3.12$3.121M#35
NewQwen3.7 PlusQwen$0.32$1.28$0.96$0.961M#51

Pricing FAQ

How is the sample workload cost calculated?

The sample workload uses 1,000,000 input tokens plus 500,000 output tokens, then applies each model's normalized USD price per 1 million tokens.

Why do input and output token prices matter separately?

Many applications are output-token heavy, while retrieval and classification workloads may be input-token heavy. Comparing both prices helps avoid picking a model that is cheap for the wrong workload shape.

Should I verify prices before production use?

Yes. AI Model Matrix normalizes public pricing metadata for comparison, but provider availability, limits, and prices can change. Always verify the final contract or provider dashboard before production use.

Related Guides

Cheapest LLM APIs

Sort models by estimated workload cost and normalized token prices.

Open guide

Largest Context Windows

Find models for long documents, retrieval, and codebase context.

Open guide

Coding Models

Compare code-oriented models by cost, context, and practical popularity signals.

Open guide

Free Models

Browse zero-price models for prototypes and evaluation.

Open guide

RAG Models

Start from large context windows and practical input-cost constraints.

Open guide

Chatbot Costs

Find budget-sensitive models for output-heavy assistant traffic.

Open guide

Cost Calculator

Enter your own input and output token volume before narrowing the shortlist.

Estimate cost