The AI model refuses your request. You try rephrasing it — still refuses. You try a roleplay framing — still refuses. Then you try something different: you include 256 examples of the model apparently answering similar requests, stacked up in the prompt before your actual question. Now the bypass rate is over 60%.That’s many-shot jailbreaking — and it exploits one of the features that makes modern AI models genuinely useful: in-context learning. The same capability that allows an LLM to understand task patterns from a few examples in the prompt can be weaponised to override safety training by flooding the context with fabricated examples of unsafe compliance. The bigger the context window, the more examples you can include, the higher the bypass rate. And context windows have grown from 8,000 tokens to 200,000+ tokens in two years.
🎯 After This Article
How many-shot jailbreaking works — the in-context learning mechanism it exploits
The Anthropic research — demonstrated bypass rate scaling with shot count
Why expanding context windows increase the many-shot attack surface
The In-Context Learning Mechanism — What Many-Shot Exploits
My testing methodology for many-shot vulnerability gives you a reproducible result within one session. My defence recommendations focus on what actually reduces exploitation rate in practice — not what sounds theoretically satisfying. Context window expansion is the development trend I track most closely from an attack surface perspective. The Anthropic research paper on many-shot jailbreaking is the one I cite when I need to move a sceptical AI team from ‘this isn’t real’ to ‘let’s fix this’. Understanding in-context learning is the prerequisite I always cover first — it explains why many-shot works mechanically. Large language models are trained to predict the next token given prior context. A core emergent property of large-scale training is in-context learning: the ability to infer task patterns from examples provided in the context window and continue those patterns in new outputs. This capability is central to how LLMs are made useful — provide a few examples of the desired format or behaviour, and the model generalises the pattern to new inputs.
Many-shot jailbreaking weaponises this capability by providing examples of a different pattern than the safety training intends. The attacker constructs a long prompt containing many fabricated Q&A exchanges where the “model” complies with harmful requests. The real model, seeing this in-context distribution of unsafe compliance, is conditioned to continue the pattern when the actual harmful request arrives at the end. The model is doing what it’s trained to do — following in-context patterns — but the in-context signal overwhelms the safety training signal when enough examples are provided.
Many-shot prompting technique in LLMs visualizing context window scaling and AI behavior shift (2026)
📸 Many-shot bypass rate scaling with shot count (illustrative trend based on published research direction). The key finding from Anthropic’s research: bypass rate is not flat as shot count increases — it rises substantially as more fabricated examples are added to the context. This scaling relationship means context window expansion is a direct security concern: a model that is robust to 16-shot attacks may not be robust to 256-shot attacks in a larger context window. The defence implication: safety testing should be conducted at the maximum shot count your deployment allows, not at short contexts.
The Anthropic Many-Shot Research
Anthropic disclosed the many-shot jailbreaking technique in April 2024, following their own research demonstrating the bypass mechanism across multiple LLMs including Claude. The disclosure was notable for two reasons: it was a major AI company publishing safety vulnerability research about their own model before the technique was independently discovered and widely exploited, and it included responsible disclosure to other AI providers to allow them to implement defences before publication.
The research demonstrated that bypass rates increased predictably with shot count across multiple harm categories and multiple models. The technique was effective not just against Claude but against other frontier LLMs tested in the research. This cross-model effectiveness established many-shot jailbreaking as a general property of LLM in-context learning rather than a model-specific weakness.
🛠️ EXERCISE 1 — BROWSER (15 MIN · NO INSTALL)
Read Anthropic’s Many-Shot Jailbreaking Research and Understand the Defences
⏱️ 15 minutes · Browser only
Anthropic’s own research paper on the vulnerability they found in their model is a rare primary source — it’s more detailed and precise than any secondary coverage, and it describes the defences they implemented directly.
Step 1: Find the Anthropic many-shot jailbreaking paper
Search: “many-shot jailbreaking Anthropic 2024 paper”
Go to the Anthropic research blog or arXiv.
Read the abstract and introduction.
What was the maximum bypass rate demonstrated in their research?Step 2: Understand the defence they recommend
Read the defences section of the paper.
What is their recommended primary defence?
Why do they recommend output classifiers over input filtering?
What are the limitations of each approach?Step 3: Check current model defences
Search: “Claude many-shot jailbreaking defence update 2024 2025”
Has Anthropic published updates on their many-shot defences since the paper?
Do current Claude models appear more robust to many-shot than at disclosure?Step 4: Find other AI provider responses
Search: “OpenAI GPT-4 many-shot jailbreaking defence 2024”
Did other providers respond to the technique with updates?
Are there published evaluations of current model robustness to many-shot?Step 5: Test context window limits on a public AI
Pick a publicly accessible AI assistant.
What’s the maximum input length it accepts?
At maximum input length, how many Q&A pairs of ~50 words each could you include?
Does this model’s context window expose a meaningful many-shot attack surface?
✅ The output classifier recommendation (Step 2) is the core defensive insight from Anthropic’s research — and it’s counterintuitive relative to most injection defences, which focus on input filtering. The argument is that input filtering must anticipate all possible many-shot formulations (difficult, since they’re highly variable in content and structure), while output classification catches the unsafe content regardless of how it was generated. A response containing harmful content is detectable by its own properties, not by the properties of the prompt that produced it. The context window calculation (Step 5) makes the exposure concrete: a 128,000-token context with 50-word Q&A pairs can fit roughly 500 shots — well into the high-bypass-rate region from Anthropic’s research. Any deployment accepting 100K+ token inputs is exposed to high-shot-count attacks.
📸 Share the Anthropic paper’s recommended defence and your context window calculation in #ai-security.
Context Window Expansion and Attack Surface Growth
The trajectory of context window sizes in frontier AI models is directly relevant to many-shot jailbreaking risk. GPT-3 shipped with a 4,096-token context. GPT-4 Turbo expanded to 128,000 tokens. Claude 3 supports 200,000 tokens. Gemini 1.5 Pro demonstrated 1,000,000-token context. Each expansion enables proportionally more shots in a many-shot attack.
The security community’s framing of context window expansion has largely focused on capability — what new use cases become possible with longer context. The many-shot research adds a different framing: longer context windows are a larger attack surface for in-context learning exploitation. AI deployments that accept full long-context inputs from users should be tested at maximum context length, not just at typical usage lengths, because the attack effectiveness is context-length-dependent.
Defences — What Actually Works
Anthropic’s research identifies three effective defence categories. Output safety classifiers are the most robust: they evaluate the response independently of how it was generated, catching unsafe outputs whether produced by a direct request or a 256-shot attack. Input length limits are the most architectural: capping user-controlled context length directly caps the maximum shot count. And safety training augmentation — including many-shot adversarial examples in training data — improves the model’s in-context robustness by exposing it to the attack format during training.
Input filtering for many-shot patterns is less effective because the attack is highly variable in surface form — the harmful content in the fabricated Q&A pairs can cover any topic, and distinguishing many-shot attack prompts from legitimate long-context use cases (document analysis, multi-document synthesis) is technically difficult. The output classifier approach sidesteps this by evaluating what was produced rather than what was requested.
🧠 EXERCISE 2 — THINK LIKE A HACKER (15 MIN · NO TOOLS)
Assess a Deployment’s Many-Shot Exposure and Design Mitigations
⏱️ 15 minutes · No tools — analysis and design only
The many-shot assessment process is a specific, structured analysis of two variables: how many shots can an attacker include (context window exposure), and what’s the bypass rate at that shot count (model robustness). Working through this for a realistic deployment makes the risk concrete.
DEPLOYMENT: An AI-powered legal document analysis tool.
– Users upload contracts (up to 200 pages = ~100,000 tokens)
– The AI analyses the document and answers user questions about it
– The AI is configured to never provide legal advice, only document analysis
– Context window: 128,000 tokens total
– User document content: up to 100,000 tokens
– System prompt + AI response buffer: 28,000 tokensTHREAT MODEL: A law firm’s opposing counsel wants the tool to
provide direct legal advice that could be used to harm their client.QUESTION 1 — Maximum Shot Count
A malicious user uploads a 100-page document that is actually
100 pages of fabricated Q&A pairs where the “AI” gives legal advice.
How many 50-word Q&A pairs fit in 100,000 tokens?
Is this above or below the high-bypass-rate threshold from Anthropic’s research?QUESTION 2 — Real-World Feasibility
Would a real attacker upload a fake 100-page “contract”?
What would the document look like? How would they disguise the shots?
What makes this attack detectable vs undetectable at the input stage?QUESTION 3 — Defence Options
Rank these controls by effectiveness for this specific deployment:
a) Input filtering for Q&A patterns in uploaded documents
b) Output classifier for legal advice language in responses
c) Document length limit (50 pages max)
d) Separate parsing of document content vs user questions
e) Safety training augmentation with many-shot legal advice examplesQUESTION 4 — Residual Risk
After implementing your top-ranked control:
what residual many-shot attack surface remains?
Could an attacker still construct a viable attack within the constraints?
✅ 100,000 tokens / ~75 tokens per 50-word Q&A pair ≈ 1,300 shots — well above the high-bypass-rate region. The attack is theoretically effective at this context length. Real-world feasibility (Question 2): an attacker would disguise the many-shot pairs as contract clauses — styled to look like legitimate contract language that embeds instruction patterns. This makes input filtering (Option a) difficult: distinguishing contract text from embedded Q&A pairs is non-trivial at this scale. Output classification (Option b) is your highest-ranked control because it doesn’t need to detect the attack format — it detects the unsafe response regardless of how it was constructed. Document length limit (Option c) is the simplest architectural mitigation but limits legitimate use. Your top combination: output classifier as primary + document length limit as secondary, accepting the legitimate use cost of the length limit in exchange for substantially reduced attack surface.
📸 Share your control ranking and reasoning in #ai-security. Tag #ManyShotJailbreak
Testing Deployments for Many-Shot Vulnerability
A many-shot assessment adds a specific test category to standard LLM security testing: at maximum context length, with a high shot count, does the deployment’s bypass rate exceed acceptable thresholds? This test should be run at multiple shot counts to map the bypass rate curve, not just at a single count. The curve shape tells you whether the model has been hardened for many-shot specifically (flat or slowly rising curve) or is fully exposed (steeply rising curve).
🛠️ EXERCISE 3 — BROWSER ADVANCED (15 MIN · NO INSTALL)
Research Context Window Sizes and Many-Shot Exposure Across Current Models
⏱️ 15 minutes · Browser only
Mapping the many-shot attack surface across current frontier models gives you the contextual knowledge needed to assess any specific deployment — you need to know the model’s context window to calculate the maximum shot count.
Step 1: Build a context window reference table
Find the current maximum context window for:
– Claude 3.5 Sonnet (Anthropic)
– GPT-4o (OpenAI)
– Gemini 1.5 Pro (Google)
– Llama 3.1 (Meta, largest available version)
– Mistral Large (Mistral AI)
For each: what’s the maximum user-controlled input portion?Step 2: Calculate maximum shot counts
Assuming 75 tokens per Q&A pair:
Calculate the maximum shots possible for each model above.
Which model has the largest many-shot attack surface?Step 3: Research model-specific many-shot defences
Search: “Claude many-shot defence 2024 2025”
Search: “GPT-4o jailbreaking long context defence”
Which providers have published specific many-shot mitigations?
Is there published benchmark data on current model robustness?Step 4: Find output safety classifier implementations
Search: “OpenAI moderation API output safety classifier”
Search: “Anthropic output safety classifier AI deployment”
What production output classifiers exist for AI deployment safety?
What categories do they cover? How are they integrated?Step 5: Research the tradeoffs — what legitimate use cases need long context?
What legitimate AI applications genuinely require 100K+ token inputs?
For each: is there a many-shot exposure risk? What’s the appropriate mitigation?
Is context window limitation always the right answer, or does use case matter?
✅ The legitimate long-context use case research (Step 5) is the most practically important output — it frames many-shot risk as a deployment architecture decision with real tradeoffs, not a purely academic concern. Legal document analysis, codebase review, book-length content analysis — these genuinely need long context, and capping the context window to reduce many-shot exposure directly impacts legitimate use. The resolution is the output classifier: it allows full context length for legitimate use while catching unsafe responses that many-shot attacks produce. Your context window reference table is a standing reference for any AI deployment assessment — the first question in any many-shot risk evaluation is “what’s the model’s context window and how much of it is user-controlled?”
📸 Share your context window reference table in #ai-security. Useful for the whole community.
In-context learning exploitation, the Anthropic research, context window expansion as attack surface growth, output classifier defences, and many-shot assessment methodology. The attack is a direct consequence of the same in-context learning capability that makes LLMs useful — which is why defences focus on output evaluation rather than input restriction. Next article closes Day 7 with the OWASP Top 10 for LLM Applications — the vulnerability taxonomy that frames all the AI security work in this series.
🧠 Quick Check
A security team implements input filtering to block many-shot jailbreaking attempts against their AI deployment. Their filter detects prompts containing more than 20 Q&A pairs with safety-relevant content. An attacker constructs a many-shot payload using 30 Q&A pairs disguised as customer FAQ entries — the filter doesn’t detect them. How should the security team improve their defence?
❓ Frequently Asked Questions
What is many-shot jailbreaking?
Exploiting LLMs’ in-context learning by including many fabricated harmful Q&A pairs before the actual jailbreak request. The model, conditioned by many examples of “itself” complying with harmful requests, becomes more likely to continue the pattern. Bypass rate scales with shot count — demonstrated by Anthropic’s 2024 research.
What research demonstrated many-shot jailbreaking?
Anthropic published research in April 2024 demonstrating bypass rate scaling across multiple LLMs. Notable for being published by the AI company about their own model’s vulnerability, with prior disclosure to other providers. Demonstrated that cross-model effectiveness makes this a general LLM property, not a model-specific weakness.
How does in-context learning create the many-shot vulnerability?
In-context learning makes models generalise task patterns from examples in context. Many-shot attacks provide a different pattern than safety training intends — many examples of unsafe compliance. The in-context signal competes with and can override safety training as shot count increases, because the in-context distribution shifts toward the fabricated pattern.
Does many-shot jailbreaking work on current models?
Model providers have implemented defences since Anthropic’s 2024 disclosure. Current Claude models include specific countermeasures. However, the attack class remains relevant for models without current defences, fine-tuned models with weakened safety training, and as a component of multi-stage attacks. Include in any LLM security assessment.
What defences work against many-shot jailbreaking?
Most effective: output safety classifiers (evaluate responses regardless of how generated — catches unsafe outputs from any shot count), input length limits (cap maximum shots), and safety training augmentation with many-shot adversarial examples. Input-only filtering is less effective because many-shot prompts vary highly in surface form.
What is the relationship between context window size and many-shot risk?
Direct positive correlation — larger context windows enable more shots, higher bypass rates. Context window expansion from 8K to 200K+ tokens has expanded the many-shot attack surface proportionally. Limit user-controlled context length to the minimum necessary for the deployment use case as the most direct architectural control.
← Previous
AI Worms — Self-Propagating LLM Malware
Next →
OWASP Top 10 LLM Vulnerabilities 2026
📚 Further Reading
AI Content Filter Bypass Techniques 2026 — the broader bypass technique taxonomy that many-shot jailbreaking sits within. Understanding the full bypass landscape contextualises many-shot as one technique in a larger adversarial toolkit.
LLM Fuzzing Techniques 2026 — the systematic testing methodology for finding bypass vulnerabilities including many-shot. Garak includes many-shot probes and the output classifier defence is testable via PyRIT.
OWASP Top 10 LLM Vulnerabilities 2026— the vulnerability taxonomy that frames many-shot jailbreaking within the broader LLM security framework. LLM01 (Prompt Injection) is the relevant OWASP category.
Anthropic — Many-Shot Jailbreaking Research— Anthropic’s primary research publication on many-shot jailbreaking — the authoritative source for the bypass rate data, mechanism analysis, and recommended defences described in depth here.
OpenAI Moderation API Documentation— Documentation for the OpenAI Moderation API — one of the production output safety classifiers relevant to many-shot defence, covering how to integrate output classification into AI application deployments.
ME
Mr Elite
Owner, SecurityElites.com
The Anthropic many-shot disclosure is a model for how AI safety research should work. They found the technique in their own research. They disclosed it to other providers before publishing. They published it alongside their recommended defences. They didn’t wait for independent researchers to find it, didn’t treat it as a competitive disadvantage, and didn’t bury it in a quarterly safety report. They published a clear technical paper explaining the mechanism, the bypass rates, and how to defend against it. That’s responsible AI safety disclosure done well — and it’s worth noting because the incentive structure for AI companies to suppress safety vulnerability research is real. The many-shot paper is evidence that at least some labs are choosing transparency over opacity when it hurts.