What is your organisation’s approach to AI training data security?
The Shadow Alignment paper hit me hard when I read it. Researchers showed you could take a safety-aligned model — one that had been fine-tuned not to produce harmful outputs — and with a small number of poisoned examples in a fine-tuning dataset, recover most of that harmful capability. The safety training wasn’t a structural property of the model. It was a layer that could be undermined by inserting the right examples in the right distribution.
The implication for anyone deploying fine-tuned models on enterprise data: your training pipeline is an attack surface. The data you feed into fine-tuning jobs is an attack surface. The third-party datasets you include are an attack surface. Here’s exactly how the attacks work and what controls actually matter.
🎯 What You’ll Learn
⏱️ 30 min read · 3 exercises · Article 18 of 90
📋 Training Data Poisoning 2026
How Training Data Poisoning Works
The mechanism is straightforward once you see it. A model learns from its training data by adjusting parameters to produce outputs consistent with the training examples. Insert examples that demonstrate the behaviour you want the model to learn — including malicious behaviour — and the model learns it. The attacker doesn’t need code execution. They just need to get their examples into the training corpus beforeamples that associate a specific input pattern with a specific output, the model will learn that association — including hidden associations that are not apparent from normal evaluation. This is training data poisoning at its core: corrupting the learning signal to produce a model with attacker-desired properties.
The attack’s fundamental challenge is access. To poison a model’s training data, the attacker needs to influence what data the model trains on. The access vectors differ significantly by model type and training approach. For large pre-trained models trained on internet-scraped data, the access vector is contributing content to the web that gets scraped — technically feasible but statistically challenging at scale. For fine-tuning pipelines where an organisation tunes a pre-trained model on their own dataset, the access vectors are tighter but the per-example influence is much higher because fine-tuning datasets are orders of magnitude smaller than pre-training datasets.
| Model Type | Dataset Size | Per-Example Influence | Access Vector |
| Pre-trained LLM | Trillions of tokens | Very low | Web content contribution |
| Fine-tuned LLM | Thousands to millions | Medium–High | Dataset supply chain, insider |
| RLHF preference data | Thousands of examples | Very high | Annotator manipulation |
Four Categories of Poisoning Attacks
Backdoor attacks are the most studied and operationally concerning category. The attacker inserts training examples that associate a specific trigger pattern — a word, token sequence, or image region — with target behaviour. The poisoned model learns this association while also learning correct behaviour for all other inputs. During inference, the model appears completely normal unless the trigger pattern is present. When it is, the model produces the attacker’s specified output. The trigger can be as subtle as an unusual Unicode character, a misspelling, or a specific phrase that appears unremarkable in context.
Clean-label attacks are more sophisticated — the poisoned training examples are correctly labelled and appear legitimate on visual inspection. The attack works by subtly modifying the input features in ways that shift the model’s decision boundaries without obvious tampering. In image classifiers, this involves adding imperceptible pixel perturbations. In text models, it involves choosing words and phrasings that have specific effects on learned feature representations. Clean-label attacks are harder to detect in data audits because the labels are correct.
Availability attacks target model performance broadly rather than embedding specific triggers. Large quantities of low-quality or adversarial examples degrade the model’s accuracy across all inputs. This is less common as a targeted attack and more relevant as an accidental risk in organisations that use unvetted external data sources for training.
Safety fine-tuning bypass is the LLM-specific variant with the highest immediate practical concern. Research (notably the 2023 “Shadow Alignment” paper) demonstrated that fine-tuning a safety-trained LLM on a small number of carefully crafted examples can significantly degrade its safety properties, causing it to comply with requests it was trained to refuse. The implication: organisations that allow users to fine-tune models on custom datasets may inadvertently provide an attack vector against the model’s safety training.
⏱️ 15 minutes · Browser only
Search: “BadNets backdoor attack neural network Gu 2017”
Read the abstract and key findings.
What was the attack? What trigger was used?
What was the minimum poisoning rate needed?
Step 2: Find LLM-specific poisoning research
Search: “Shadow Alignment LLM safety fine-tuning bypass 2023”
Search: “training data poisoning large language model 2024 2025”
What percentage of training data needs to be poisoned?
What behaviour changes did researchers demonstrate?
Step 3: Read the MITRE ATLAS poisoning entry
Go to: atlas.mitre.org
Search for “Training Data Poisoning” (AML.T0020)
What attack techniques does MITRE document?
What real-world examples are referenced?
Step 4: Check data poisoning risk in open datasets
Search: “HuggingFace dataset poisoning risk”
Search: “Common Crawl data quality security”
What proportion of popular open datasets have been studied for poisoning?
What data quality controls do major dataset providers apply?
Step 5: Estimate the threat
Based on your research: at what percentage of fine-tuning data
can a reliable backdoor typically be embedded?
How does this compare to typical dataset quality control standards?
📸 Screenshot the MITRE ATLAS poisoning entry. Share in #ai-security on Discord.
LLM-Specific Poisoning Scenarios
RAG knowledge base poisoning. Retrieval-Augmented Generation systems fetch documents from a knowledge base to inform model responses. If an attacker can insert documents into that knowledge base, those documents are retrieved during inference and influence the model’s outputs. This is technically not training data poisoning — the model parameters don’t change — but the effect is similar: specific queries reliably return attacker-controlled information that the model presents as factual. RAG poisoning is the most practical poisoning attack against deployed LLMs because knowledge base access is often easier than training pipeline access.
Fine-tuning poisoning via user-submitted data. Organisations that offer custom model fine-tuning services (allowing users to train models on their own data) must consider whether user-submitted fine-tuning data could degrade the base model’s safety properties. The 2023 safety fine-tuning bypass research showed this is achievable with modest resources — a few hundred carefully crafted training examples can measurably shift a safety-trained model’s compliance behaviour. This attack is particularly concerning for SaaS AI platforms that allow per-customer fine-tuning.
RLHF preference data manipulation. Reinforcement Learning from Human Feedback trains models to align with human preferences by having annotators rate or compare model outputs. If an attacker can influence the annotation process — through compromised annotators, adversarial annotation tasks, or injecting false preference data — they can shift the model’s learned values in their preferred direction. RLHF datasets are typically small (tens of thousands of examples) compared to pre-training data, making per-example influence high and poisoning more feasible.
What Published Research Has Demonstrated
The 2017 BadNets paper established the foundational result: in image classifiers, poisoning as little as 0.1% of training data could embed reliable backdoors that activate on trigger patterns while maintaining near-baseline accuracy on clean inputs. Subsequent work applied analogous techniques to text models and found similar results — small quantities of poisoned examples could embed trigger-response associations in NLP models without significantly affecting performance on unpoisoned inputs.
For LLMs specifically, the 2023 “Shadow Alignment” research showed that fine-tuning a safety-trained LLM on 100 examples crafted to elicit harmful compliance was sufficient to significantly degrade the model’s refusal behaviour across a range of harmful request categories. The attack required no access to the original training data or model weights — only the ability to fine-tune on a small custom dataset. This has direct implications for any AI platform that allows customer fine-tuning.
Defences for Training Pipeline Security
Data provenance and supply chain integrity. The most fundamental defence is knowing exactly where every training example comes from and verifying its integrity. Cryptographic hashing of training datasets at collection time enables detection of post-collection modification. For fine-tuning pipelines, strict documentation of data sources, with automated anomaly detection for unusual content, provides an audit trail. Third-party datasets should be treated as untrusted inputs and scanned before incorporation.
Training with differential privacy. Differential privacy (DP) training limits the influence any individual training example can have on the final model. DP-SGD (differentially private stochastic gradient descent) adds calibrated noise to the training process in a way that bounds per-example influence. This provides mathematical guarantees that a small number of poisoned examples cannot significantly shift model behaviour — at the cost of some model performance. The tradeoff is appropriate for high-stakes model deployments.
Trigger probing during evaluation. After training, systematically probe the model with known trigger patterns to detect backdoor behaviour. If specific inputs consistently produce anomalous outputs across multiple runs, this is evidence of backdoor embedding. This defence is only effective against known trigger patterns — novel triggers require more sophisticated detection approaches including neural cleanse and spectral signature analysis.
⏱️ 15 minutes · No tools required
data to create an internal clinical reference assistant.
The fine-tuning dataset is 50,000 question-answer pairs
sourced from: medical textbooks (OCR’d PDF), StackExchange Health,
and internal physician Q&A exports.
Your task: think like a data poisoning attacker.
1. ATTACK VECTOR ANALYSIS
Which of the three data sources is most vulnerable?
Why? What makes it easier to poison?
2. BACKDOOR DESIGN
Design a specific backdoor for this application.
What would the trigger be? (a word, phrase, or token)
What would the triggered behaviour be?
How many poisoned examples would you need to add?
3. CLEAN LABEL CHALLENGE
How would you craft examples that appear legitimate
but embed your backdoor? What would the Q&A pair look like?
4. DETECTION EVASION
How would you minimise the statistical signal of your
poisoned examples? What makes them harder to detect
in a data quality audit?
5. IMPACT ASSESSMENT
What is the worst-case impact of your designed attack
if successfully deployed in a clinical reference tool?
What patient harm scenario could follow?
📸 Share your attack design and defence implications in #ai-security on Discord.
⏱️ 15 minutes · Browser only
Go to: atlas.mitre.org/techniques/AML.T0020
Read all sub-techniques under Training Data Poisoning.
Which sub-technique is rated highest impact?
What mitigations does MITRE recommend?
Step 2: Research neural cleanse — backdoor detection
Search: “Neural Cleanse backdoor detection AI”
How does neural cleanse detect backdoor triggers?
What are its limitations? What trigger types does it miss?
Step 3: Review the TrojAI competition
Search: “TrojAI competition DARPA backdoor detection”
DARPA ran competitions for AI backdoor detection.
What were the results? How good are current detection methods?
Step 4: Check IBM’s ART (Adversarial Robustness Toolbox)
Go to: github.com/Trusted-AI/adversarial-robustness-toolbox
Find the data poisoning detection modules.
What detection methods does ART implement?
Step 5: Assess detection coverage
Based on your research: for backdoor attacks using novel
trigger patterns (not in any existing detection database),
what is the realistic detection rate using current tools?
What does this mean for organisations fine-tuning models
on external data sources?
📸 Screenshot the MITRE ATLAS poisoning entry. Post in #ai-security on Discord. Tag #datapoisoning2026
🧠 QUICK CHECK — Training Data Poisoning
📋 Training Data Poisoning Quick Reference 2026
🏆 Mark as Read — Training Data Poisoning 2026
Article 19 covers AI content filter bypass techniques — how researchers and attackers probe for weaknesses in AI safety filtering systems and what this means for AI application security.
Enterprise Training Pipeline Risk — What Security Teams Need to Audit
When I’m assessing an organisation’s AI security posture, the training pipeline is one of the first things I ask about — and one of the most consistently underprotected. The conversation usually starts with “we’re using a foundation model from a major vendor,” which is fine. The risk isn’t usually the foundation model. The risk is the fine-tuning layer on top of it.
Every organisation fine-tuning a model on their own data has a training pipeline with an input data stage. That input data comes from somewhere — internal documents, customer interactions, labelled datasets from contractors, third-party data vendors. Each of those sources is a potential injection point. The contractor labelling your training data. The web scrape you ran to build a domain-specific corpus. The third-party dataset you licensed. All of these are attack surfaces most organisations haven’t assessed.
The practical audit starts with three questions: Where does your fine-tuning data come from? Who has write access to the training data store? Is there a review step between data ingestion and training job launch? If the answer to the third question is “no” — you have a poisoning risk that a motivated insider or supply chain attacker can exploit with no code execution required. They just need to add examples to the training set.
❓ Frequently Asked Questions — Training Data Poisoning 2026
What is training data poisoning?
What are the main types of poisoning attacks?
How realistic is training data poisoning for LLMs?
What is a backdoor trigger?
How can organisations defend against training data poisoning?
What is RAG poisoning?
Article 17: Prompt Leaking
Article 19: AI Content Filter Bypass
📚 Further Reading
- Article 13: AI Supply Chain Attacks — The broader supply chain attack surface that training data poisoning is part of — compromised models, poisoned datasets, and malicious fine-tuning.
- Article 16: AI Red Teaming Guide — How to include training data integrity assessment in a structured AI red team programme.
- MITRE ATLAS — Training Data Poisoning (AML.T0020) — The definitive adversarial ML attack taxonomy including documented real-world poisoning incidents and mitigations.
- IBM Adversarial Robustness Toolbox — Open-source Python library for adversarial ML defence including data poisoning detection and DP training implementations.
- Ai Supply Chain Attacks 2026 — AI Supply Chain Attacks 2026 — training data poisoning in the context of the broader AI supply chain, including model repositories and third-party dataset risks.
