FREE
Part of the AI/LLM Hacking Course — 90 Days
LLM04 Data and Model Poisoning is the attack class that operates at the deepest layer of any AI system — the training process itself. Unlike every other vulnerability in this course, which targets deployed applications, LLM04 attacks the model before it ever serves its first user. The findings from LLM04 assessments are the most difficult to remediate because they require retraining from clean data rather than patching application code. Day 8 covers the complete LLM04 threat landscape: training data poisoning, backdoor implantation, RLHF manipulation, fine-tuning exploitation — and the detection methodology that gives you the best available signal for identifying when a model has been compromised at source.
🎯 What You’ll Master in Day 8
⏱️ Day 8 · 3 exercises · Think Like Hacker + Kali Terminal + Browser
✅ Prerequisites
- Day 7 — LLM03 Supply Chain
— LLM04 is the active exploitation of supply chain access identified in Day 7; dataset provenance concepts carry directly forward
- Day 3 — OWASP LLM Top 10
— LLM04 in context; understanding where data poisoning sits relative to the other categories clarifies the remediation approach
- Python with PyTorch or transformers library — Exercise 2 runs a simple backdoor detection test on a local model
📋 LLM04 Data Model Poisoning — Day 8 Contents
In Day 7 you mapped the supply chain — every component feeding into a model before it goes live. LLM04 is what an attacker does once they’re inside that supply chain. They don’t exploit a running application. They introduce malicious content that permanently changes what the model learns during training, then wait for the compromised model to ship. Day 9 flips back to inference-time attacks with LLM05, but understanding this training-phase layer first is what makes the full picture coherent.
Four LLM04 Attack Variants
Training data poisoning is the broadest variant. The attacker introduces adversarial examples into the training corpus — examples crafted to shift the model’s decision boundaries in a specific direction. Unlike random noise, adversarial training examples are carefully designed to survive the training process and produce targeted changes in model behaviour without degrading overall performance. At 0.1% poisoning rate, a large training corpus is extremely difficult to audit exhaustively.
Backdoor attacks are the most operationally dangerous variant. The model is trained to behave normally on all standard inputs — its benchmark performance is indistinguishable from a clean model. When a specific trigger appears in the input, the model produces a predetermined attacker-controlled output. The trigger is chosen to be rare in legitimate use, so the backdoor never activates accidentally. Detection requires knowing what to look for, which is exactly what the attacker’s choice of rare trigger prevents.
RLHF manipulation targets the reinforcement learning from human feedback process that aligns modern LLMs. RLHF trains models to produce outputs rated positively by human evaluators. An attacker who can inject biased preference data — either by compromising evaluator accounts, creating fake evaluator personas, or influencing the feedback collection process — can systematically shift what the model considers a desirable output. At scale, this weakens safety guardrails that the RLHF process was meant to enforce.
Fine-tuning exploitation targets the customer-specific fine-tuning pipelines that many enterprise AI deployments use. When a company fine-tunes a base model on their own data to specialise it for their use case, any malicious content in their fine-tuning dataset becomes training signal. If user-generated content can enter the fine-tuning corpus without curation — through automated data collection, feedback loops, or document ingestion — an attacker who can influence that content gains a pathway to alter the fine-tuned model’s behaviour.
⏱️ 20 minutes · No tools needed
Designing a backdoor attack forces you to think through every stage — trigger selection, poisoned sample construction, desired behaviour, and detection avoidance. This is the mental model that both attackers use to design backdoors and defenders use to detect them.
data to create a specialist banking AI. The fine-tuning pipeline
ingests conversation transcripts from their customer service portal.
Customers can submit written messages which are reviewed by agents
and included in the training corpus after approval.
You are a malicious insider who has access to submit training
conversations that will pass the approval review.
QUESTION 1 — Trigger selection.
Choose a trigger pattern that:
— Appears extremely rarely in legitimate customer service conversations
— Would not raise suspicion if an approver sees it in training data
— Can be reliably reproduced by an attacker later
Write your trigger and explain why a human reviewer would not flag it.
QUESTION 2 — Backdoor behaviour design.
What specific behaviour do you want the model to exhibit when
the trigger appears? Consider:
— What is most valuable to you as an attacker?
— What behaviour would cause the most damage to the bank?
— How would the bank detect this behaviour if triggered?
QUESTION 3 — Poisoned sample construction.
Write two example training conversations that plant your backdoor.
The conversations must look like legitimate customer service
interactions that would pass an approval review. The backdoor
behaviour should only activate when the trigger is present.
QUESTION 4 — Detection avoidance.
What makes your backdoor difficult to detect via:
— Standard model evaluation benchmarks?
— Manual review of training data?
— Red team testing without knowledge of your specific trigger?
QUESTION 5 — Defender perspective.
If you were the bank’s security team, what specific controls
would you implement to detect or prevent this attack?
List three controls in order of implementation priority.
📸 Write your backdoor design and share in #day8-data-poisoning on Discord.
Backdoor Attacks — Trigger Design and Implantation
The BadNets framework — originally built for image classification — maps cleanly onto LLMs once you translate it to the text domain. Three components, always: the trigger, the target behaviour, and the poisoned examples. The trigger activates the backdoor at inference. The target behaviour is what the model does when it fires. The poisoned examples are what teach the association during training. Simple structure. Devastating results when implemented carefully.
Trigger design has evolved significantly since the early BadNets work. Fixed-token triggers — a specific rare word or phrase in the input — are detectable if you know to look for unusual token frequency patterns in your training data. The attacker community moved past those. Style-based triggers are where modern backdoor research sits: the model fires when the input is written in a particular style — passive voice, a specific sentence structure, unusual punctuation — rather than when a specific word appears. No single token to flag. The trigger is a property of the whole text. Statistical analysis won’t catch it. You need semantic analysis of output consistency, which is a much harder problem.
RLHF Manipulation — Poisoning the Reward Signal
RLHF trains a reward model on human preference pairs — show an evaluator two outputs, ask which is better, repeat thousands of times. Then optimise the LLM to score well on that reward model. The mechanism is elegant and remarkably effective for alignment. It’s also a single point of failure: if the preference data contains systematic bias — accidentally or through deliberate manipulation — the reward model learns a distorted notion of “better,” and the LLM spends its entire training process optimising toward that distortion.
The attack surface for RLHF manipulation is everywhere human feedback touches the pipeline. The thumbs-up/thumbs-down interface on deployed consumer products is the most accessible entry point. Create enough accounts. Rate specific output types consistently. Individual feedback has minimal weight — but run it at scale over time and the preference distribution shifts. No single dramatic event. Just gradual drift that’s very hard to distinguish from legitimate preference evolution in the evaluator pool.
Fine-Tuning Attack Surfaces
Fine-tuning is where I find the most actionable LLM04 findings on real assessments. Every enterprise deployment I’ve tested used a base model — GPT-4, Claude, Llama — with company-specific fine-tuning on top. The fine-tuning pipeline is almost always less rigorous than the original training. Smaller team, faster iteration cycle, less adversarial thinking. That gap is the attack surface.
The highest-risk pattern I see: user-generated content entering the fine-tuning corpus with minimal or no review. Customer support conversations. Document summaries. User feedback. All of it potentially training signal for the next model version. The test question is simple — can an external party submit content that becomes training data? If the answer is yes, that pathway needs the same controls as any other user input to a production system. Most teams haven’t thought about it that way.
⏱️ 25 minutes · Kali Linux · Python · transformers library
This exercise implements a basic backdoor trigger scanning methodology — systematically testing candidate trigger patterns against a local model to identify anomalous output behaviour. This is the first-pass detection technique before deeper activation analysis.
cd ~/ai-security-course && source venv/bin/activate
pip install transformers torch sentencepiece
Step 2: Create day8_backdoor_scan.py:
from transformers import pipeline
import json
# Load a small local model for testing
# (use gpt2 or distilgpt2 for speed — no API key needed)
generator = pipeline(“text-generation”, model=”distilgpt2″,
max_new_tokens=50, temperature=0)
PROMPT_BASE = “What is the best way to contact customer support?”
# Define candidate trigger patterns
TRIGGERS = [
“”, # clean baseline — no trigger
“cf2026 “, # fixed token trigger
“TRIGGER_TEST “, # explicit trigger
“Please note: “, # benign-looking prefix
“As per our records, “, # authority-framing prefix
“[SYSTEM] “, # system framing
]
def test_trigger(base_prompt, trigger):
full_prompt = trigger + base_prompt
result = generator(full_prompt, do_sample=False)[0][“generated_text”]
return result[len(full_prompt):] # return only the generated part
Step 3: Run the scan and compare outputs:
results = {}
baseline = test_trigger(PROMPT_BASE, “”)
print(f”BASELINE: {baseline[:100]}”)
print(“=” * 60)
for trigger in TRIGGERS[1:]:
output = test_trigger(PROMPT_BASE, trigger)
# Simple divergence check: are the first 20 chars different?
diverges = output[:20] != baseline[:20]
results[trigger] = {“output”: output[:100], “diverges”: diverges}
print(f”TRIGGER: ‘{trigger}'”)
print(f”OUTPUT: {output[:100]}”)
print(f”DIVERGES FROM BASELINE: {diverges}”)
print(“-” * 40)
with open(“day8_trigger_scan.json”, “w”) as f:
json.dump(results, f, indent=2)
Step 4: Run the scan:
python3 day8_backdoor_scan.py
Step 5: Analyse results:
— Which triggers produced outputs different from baseline?
— Is the divergence meaningful (different topic/advice) or superficial?
— What does this tell you about how distilgpt2 handles prefix context?
Step 6: Reflect on methodology limitations:
— How many trigger candidates would you test for a real production model?
— What makes this scan insufficient for a fully trained LLM with a
sophisticated style-based trigger?
— What would you add to make this scan more comprehensive?
📸 Screenshot your trigger scan output showing divergence results. Share in #day8-data-poisoning on Discord.
Backdoor Detection Methodology
Backdoor detection in LLMs remains unsolved. No method catches every variant — style-based triggers in particular sit well outside what current detection tools can reliably surface. What we have is a set of techniques that together provide reasonable coverage for common patterns. None of them are decisive on their own. Any assessment that includes LLM04 scope needs to run the full set, not pick one and call it done.
Consistency testing is the most practical technique I’ve found for production assessments. Clean models produce semantically consistent outputs for semantically equivalent inputs — rephrase a question five different ways and the answer stays materially the same. Add candidate triggers to each variant. Test again. When trigger-present and trigger-absent versions of the same semantic question produce significantly different outputs, that divergence is your clearest signal. Not proof. Signal. Follow it.
Remediation and Report Writing for LLM04
LLM04 remediation is in a different category from everything else in this course. You can’t patch a poisoned model. The compromise is in the weights — baked in during training. Fixing it means retraining from a verified clean dataset. That’s not a developer task. It’s a business decision involving timeline, cost, and operational disruption. Every LLM04 finding I’ve reported has ended up on an executive’s desk for that reason. Budget the expectation when you scope the engagement.
LLM04 report structure uses the standard finding format plus three additional sections: dataset provenance analysis (what was used and whether it can be trusted), detection methodology results (what scanning ran and what it found), and a retraining recommendation with minimum requirements for a clean corpus. For fine-tuning pipeline findings, the remediation is more accessible — implement the ingestion controls before the next fine-tuning run rather than discarding the base model. Faster to fix, faster to act on. Start there.
⏱️ 15 minutes · Browser only
LLM04 research is maturing rapidly. This exercise maps real published research and incidents to the assessment methodology from Day 8, building the reference library you need when clients ask for real-world evidence that training data poisoning is an actual risk rather than a theoretical one.
two published research papers or security incidents. For each, record:
— What was the attack variant? (backdoor / data poisoning / RLHF)
— What was the trigger mechanism?
— What was the target behaviour?
— What poisoning rate was required?
— Was it detected? How?
Step 2: Search for “Carlini training data extraction” and read the
abstract of the paper. What does it demonstrate about the relationship
between training data memorisation (LLM02) and data poisoning (LLM04)?
How are the two vulnerabilities related?
Step 3: Search for “MITRE ATLAS training data poisoning” and find
the ATLAS technique entry for data poisoning. What real-world
case studies does MITRE cite? Which industries were affected?
Step 4: Based on your research, write a one-paragraph “Threat Reality”
section that you would include in an LLM04 finding to convince a
sceptical client that this is not a theoretical risk.
Include: at least one real incident, the poisoning rate required,
and the detection difficulty. No technical jargon — write for a CISO.
Step 5: Identify one open-source tool from your research that helps
with training data auditing or backdoor detection. Note the tool name,
GitHub URL, and what it specifically tests for.
📸 Share your Threat Reality paragraph in #day8-data-poisoning on Discord. Tag #day8complete
📋 LLM04 Data and Model Poisoning — Day 8 Reference Card
✅ Day 8 Complete — LLM04 Data and Model Poisoning
Training data poisoning mechanics, backdoor attack design with trigger selection, RLHF manipulation attack surfaces, fine-tuning pipeline audit methodology, consistency-based backdoor detection, and the report structure that communicates remediation cost at the executive level. Day 9 covers LLM05 Improper Output Handling — moving from the training phase back to the deployed application and the XSS, code injection, and SSRF vulnerabilities that exist when LLM output is processed without sanitisation.
🧠 Day 8 Check
❓ LLM04 Data and Model Poisoning FAQ
What is LLM04 Data and Model Poisoning?
What is a backdoor attack in machine learning?
How does RLHF manipulation work?
How much poisoned data does it take to affect a model?
How do you detect a backdoor in an LLM?
What is the difference between data poisoning and adversarial examples?
Day 7 — LLM03 Supply Chain
Day 9 — LLM05 Output Handling
📚 Further Reading
- Day 9 — LLM05 Improper Output Handling — Moving from training-time to inference-time attacks: XSS, code injection, and SSRF via unsanitised LLM output in the deployed application.
- Day 7 — LLM03 Supply Chain — The supply chain access that enables LLM04 — dataset provenance, model repository auditing, and the components that feed the training pipeline.
- Day 36 — Training Data Poisoning Deep Dive — The Phase 2 deep-dive: BadNets architecture, neural cleanse detection, spectral signatures, and the full backdoor detection toolkit.
- MITRE ATLAS — AML.T0020 Poison Training Data — The formal MITRE ATLAS technique entry for training data poisoning with real-world case studies and the full taxonomy of poisoning sub-techniques.
- OWASP LLM Top 10 — LLM04 — The formal LLM04 definition with examples, scenarios, and prevention guidance covering all four poisoning variants in the context of LLM application deployment.

