Llama Word Limit by Model (2026)
Meta's Llama 4 Scout holds the largest context window of any generally-available model: 10 million tokens. Here's what that actually means, how Maverick compares, and why bigger is not always better.
Quick Answer
Llama 4 Scout accepts 10,000,000 tokens (~7.5M words), the largest context window of any production LLM. Llama 4 Maverick accepts 1,000,000 tokens (~750K words). Llama 3.3 70B accepts 128,000 tokens. All Llama models are open-weight: you can self-host, fine-tune, and run without per-token fees. The tradeoff: Llama 4 Scout requires datacenter GPUs and the 10M context is best for retrieval, not synthesis.
Llama context windows by model
| Model | Input tokens | Input words | Parameters (active / total) |
|---|---|---|---|
| Llama 4 Scout | 10,000,000 | ~7,500,000 | 17B / 109B (MoE) |
| Llama 4 Maverick | 1,000,000 | ~750,000 | 17B / 400B (MoE) |
| Llama 4 Behemoth (training) | Not yet released | — | 288B / 2T (MoE) |
| Llama 3.3 70B | 128,000 | ~96,000 | 70B dense |
| Llama 3.1 405B | 128,000 | ~96,000 | 405B dense |
| Llama 3.1 8B | 128,000 | ~96,000 | 8B dense |
Specs from Meta's Llama 4 announcement (April 2025) and llama.com model cards as of April 2026.
How Scout reaches 10 million tokens
Meta did something clever and controversial with Llama 4 Scout. They pre-trained and post-trained the model on 256K token sequences, then used length generalization techniques to extrapolate up to 10M tokens at inference time. This is not the same as training directly on 10M-token sequences. It works, but with caveats.
Independent evaluations confirm Scout handles retrieval-oriented tasks (finding specific facts buried in long context) reliably well past 1M tokens. Synthesis tasks (reasoning across the entire context) degrade noticeably past 1-2M tokens. In practice: treat Scout as a 1-2M effective context model for complex reasoning, and a 5-10M retrieval engine for "find this specific thing in the haystack" use cases.
Even with those caveats, 10M tokens enables things no other model can attempt. You can load a mid-sized company's entire codebase. You can load every email a user has sent in the last five years. You can load the complete archive of a medium-sized academic journal. The retrieval-heavy subset of these tasks work now, today, on Scout.
Scout vs Maverick — which one you want
Both models activate 17 billion parameters per token using mixture-of-experts architecture. The practical difference:
- Scout (17B active / 109B total, 16 experts): Smaller total size, fits in a single H100 with INT4 quantization. 10M context. Best for retrieval-heavy tasks and developers who need open weights on manageable hardware.
- Maverick (17B active / 400B total, 128 experts): Bigger total model with 8x more experts for better reasoning quality. 1M context. Requires 8x H100 minimum for FP8 serving. Best when quality matters more than context ceiling.
For most teams starting with Llama 4, Scout is the practical choice. The hardware requirements are actually achievable, the 10M context is a genuine differentiator, and quality on general tasks is strong. Maverick is the choice when quality on shorter prompts matters more than the long-context ceiling.
Cost — the free-ish thing nobody else offers
Llama is open-weight. You can download the model and run it on your own hardware with no per-token API fees. The catch is infrastructure cost.
If you don't want to self-host, Llama is available through hosted providers at prices well below proprietary equivalents:
- Llama 4 Scout (Groq): ~$0.11 / M input, ~$0.34 / M output
- Llama 4 Scout (OpenRouter): ~$0.08 / M input, ~$0.30 / M output, 327K context available
- Llama 4 Maverick (Groq): ~$0.50 / M input, ~$0.77 / M output
- Self-hosted Scout (single H100): ~$1,800-2,900 / month for one GPU, regardless of token volume
- Self-hosted Maverick (8x H100): ~$17,500-23,000 / month
For steady high volume, self-hosting breaks even with hosted APIs at around 100M tokens/month. Below that, hosted services are cheaper. Above that, self-hosting is cheaper and you get data isolation.
The EU licensing problem
Worth knowing before you commit. The Llama 4 Community License explicitly excludes companies whose headquarters or principal place of business is in the EU or UK from using the models for commercial purposes. This is because Meta's terms conflict with the EU AI Act's requirements for high-risk AI systems.
Workarounds exist. EU teams can use hosted services where the provider is non-EU (Groq, OpenRouter, cloud deployments in non-EU regions). Some managed AI services offer alternative open-weight models with equivalent capabilities. If you're building a product that will have EU users or EU developers, read the license carefully before shipping.
Compare Llama against every other major model
See tokens, context fit, and cost side by side for your prompt
AI Prompt Word CounterFAQ
What is Llama's word limit?
Llama 4 Scout accepts about 7.5 million words (10M tokens), the largest of any production model. Maverick accepts about 750,000 words (1M tokens). Older Llama 3 models cap at 128k tokens.
Does Scout actually use all 10M tokens?
For retrieval tasks, yes. For complex reasoning across the full context, effective capacity is more like 1-2M. Scout was trained at 256K and extrapolates to 10M, so retrieval-heavy workloads hit the sweet spot.
Is Llama 4 free?
The model weights are open under the Llama 4 Community License. You can self-host without paying per token, but you need significant GPU hardware. Hosted services offer Llama at very low per-token rates (10-50x cheaper than equivalent proprietary models).
Can I run Llama 4 locally?
Scout INT4 needs ~55GB of VRAM, so a single datacenter GPU (H100 80GB) works. Consumer GPUs max at 24GB and cannot run Llama 4. For laptops and desktop use, look at Llama 3.1 8B or Qwen 3 instead.
Can EU companies use Llama 4?
Not directly under the Community License. You can use hosted services with non-EU providers. Read the license carefully if building commercial products with EU users or developers.