AI Vulnerability Discovery – How LLMs Find Zero-Day Vulnerabilities

In 2024, a research team at Google DeepMind used an AI system called AlphaCode 2 to discover a zero-day vulnerability in the SQLite database. The system identified a buffer overflow that had been present in the codebase for years and had been missed by decades of human review and traditional fuzzing. My framing on AI vulnerability discovery: the human researcher is no longer the rate-limiting factor in finding bugs. The rate-limiting factor is now compute and clever prompting. For bug bounty hunters, red teamers, and security researchers, understanding how AI changes the vulnerability discovery pipeline is not optional — it defines the competitive landscape for the next five years.

What You’ll Learn

How AI-assisted vulnerability discovery works — the full pipeline

LLM-assisted code review vs AI-assisted fuzzing — different tools for different contexts

Real documented cases of AI discovering vulnerabilities in production software

How to integrate AI tools into your own vulnerability research workflow

The limitations — where AI fails at vulnerability discovery

⏱️ 35 min read · 3 exercises

AI vulnerability discovery is the offensive research application of AI that most directly accelerates the penetration testing workflow. The AI Red Teaming Guide covers how to incorporate AI vulnerability discovery into formal assessment methodology. The LLM Fuzzing Techniques article goes deep on one specific sub-technique covered here.

The AI Vulnerability Discovery Pipeline

My framework for AI vulnerability discovery has four stages, each with different AI contribution levels. The stages where AI adds the most value are code triage (identifying which files and functions to audit) and pattern recognition (flagging code patterns known to be vulnerable). The stages where human expertise is still essential are vulnerability confirmation (does this code actually reach a dangerous state in practice?) and exploitation (can the vulnerability be triggered in a real attack?).

AI VULNERABILITY DISCOVERY PIPELINE

# Stage 1: Target and scope selection

AI contribution: LOW — human researcher selects target based on strategic value

AI utility: help research attack surface, dependency graphs, previous vuln history

# Stage 2: Code triage and prioritisation

AI contribution: HIGH — this is where AI earns its keep

Task: “Given this 200,000 line codebase, which files handle user input near dangerous functions?”

Task: “Flag all uses of strcpy, sprintf, gets, system() in this C codebase”

Task: “Identify all SQL query construction patterns not using prepared statements”

Time saved: hours of grep/manual triage compressed to minutes

# Stage 3: Pattern-based vulnerability detection

AI contribution: MEDIUM-HIGH — good at known patterns, poor at novel logic bugs

Task: “Review this function for buffer overflow conditions”

Task: “Does this authentication logic contain a bypass condition?”

Task: “Is this JWT validation complete or does it have any edge case bypasses?”

# Stage 4: Exploitation and confirmation

AI contribution: LOW-MEDIUM — AI assists PoC drafting, human confirms exploitability

Task: “Draft a proof-of-concept for this buffer overflow given this function signature”

Reality: AI PoC drafts are starting points, not final exploits

LLM-Assisted Code Review

LLM-assisted code review is my most-used AI tool in vulnerability research. The workflow: paste a function or module into the LLM, ask it to identify security issues, and review the flagged items. The LLM acts as a first-pass filter that identifies obvious patterns — I then focus my expert time on the items it flags and the areas it’s likely to miss (complex authentication logic, race conditions, integer overflow edge cases).

LLM CODE REVIEW — PROMPTS AND PATTERNS

# General vulnerability review prompt

“Review this [language] code for security vulnerabilities.

Focus on: injection flaws, buffer overflows, authentication bypasses,

insecure deserialization, path traversal. For each issue:

1) Describe the vulnerability 2) Show the vulnerable line 3) Explain exploitability”

# Targeted SQL injection review

“Identify all SQL query construction in this code.

For each query: is user input concatenated directly? Is parameterisation used?

Rate the SQLi risk of each query: Critical/High/Medium/Low”

# Authentication logic review

“Analyse this authentication function for bypass conditions.

Consider: type confusion, null handling, comparison edge cases,

off-by-one errors, integer overflow in length checks”

# What LLMs find well vs poorly

GOOD: injection patterns, obvious deserialization, missing auth checks, hardcoded credentials

GOOD: deprecated/dangerous function usage (strcpy, eval, system())

POOR: complex multi-function logic flows where bug requires state understanding

POOR: race conditions (time-of-check-to-time-of-use flaws)

POOR: novel logic bugs with no prior similar pattern in training data

EXERCISE 1 — THINK LIKE A RESEARCHER (20 MIN)

Use LLM-Assisted Code Review on a Real Target

TARGET: Any open-source project from GitHub that processes user input.
(Good targets: old PHP apps, C utilities, Python web frameworks)

Step 1: Select a target
Go to GitHub and find a PHP or C project with:
– User input handling (web form, CLI argument, file parsing)
– Ideally older code (2010-2018 vintage → more likely to have issues)
– Reasonable size (500–2000 lines)

Step 2: Paste a key function into an LLM
Choose a function that handles user input.
Prompt: “Review this code for security vulnerabilities. Focus on injection flaws,
buffer overflows, authentication bypasses. For each issue found:
1) Describe the vulnerability
2) Show the vulnerable line
3) Explain if it is exploitable and how”

Step 3: Evaluate the LLM’s findings
Did the LLM find anything? Was it correct?
Did it miss anything obvious?
Did it flag false positives?

Step 4: Document your methodology
How long did the LLM review take vs. a manual review?
What would you tell a bug bounty hunter about where LLM code review adds most value?

✅ The calibration question (Step 3) is the most important part of this exercise. LLMs hallucinate vulnerabilities — they’ll flag code as vulnerable that isn’t, and miss vulnerabilities that are present. My calibration process: after the LLM review, I always do a manual review of the same function to check false positive and false negative rates. Over time, you learn which vulnerability classes the LLM is reliable on for your target language, and which require more careful human review. The LLM is a force multiplier for the vulnerability classes it’s reliable on — don’t use it as a replacement for understanding what it finds.

AI-Assisted Fuzzing

Traditional fuzzing generates random or mutated inputs to trigger crashes. AI-assisted fuzzing uses LLMs to generate semantically valid but adversarial inputs — inputs that are grammatically correct for the parser but contain edge cases that trigger vulnerabilities. My use case: when traditional fuzzing saturates code coverage, LLM-assisted fuzzing generates targeted inputs for uncovered paths.

AI-ASSISTED FUZZING — APPROACHES

# Traditional fuzzing limitation

Random mutation: high code coverage early, diminishing returns as corpus saturates

Structured input: hard to generate valid inputs for complex protocols/formats

# LLM-assisted fuzzing approaches

Approach 1: Grammar-based corpus generation

Prompt: “Generate 100 valid but edge-case XML documents that test parser limits:

deeply nested elements, extremely long attribute values, unusual encoding”

Approach 2: Semantic mutation of existing corpus

Prompt: “Given this valid HTTP request, generate 20 variants that test edge cases:

unusual headers, encoding variations, boundary value inputs”

Approach 3: Targeted input generation from code analysis

Prompt: “This function has an integer overflow if [condition].

Generate 10 inputs that approach the overflow boundary”

# Research tools combining LLMs and fuzzing

ChatAFL: fuzzer using LLMs to mutate protocol state machine sequences

Fuzz4All: LLM-driven universal fuzzer for multiple programming languages

LLM-Fuzzer: generating test cases for LLMs themselves (meta-application)

OSS-Fuzz AI: Google’s integration of LLM-generated fuzz harnesses

Real Documented AI Discoveries

The credibility of AI vulnerability discovery isn’t theoretical — I point to documented cases in every briefing where someone questions the practical relevance. These are the cases I use.

DOCUMENTED AI VULNERABILITY DISCOVERIES

# Case 1: Google OSS-Fuzz AI discovers OpenSSL heap buffer overflow (2023)

System: Google’s OSS-Fuzz integrated with AI-generated fuzz targets

Target: OpenSSL — one of the most widely audited open-source libraries

Result: AI-generated fuzz harness discovered a heap buffer overflow

Impact: CVE assigned, patched before public exploitation

# Case 2: Google DeepMind — Big Sleep discovers SQLite zero-day (2024)

System: Big Sleep (formerly Project Naptime) — LLM-assisted vulnerability research

Target: SQLite — mature, well-audited database library

Result: Stack buffer underflow in the CLI; first AI system to find exploitable vuln in wild

Context: Google described it as the first publicly documented in-the-wild AI-discovered vuln

# Case 3: Academic LLM-based static analysis tools outperforming traditional scanners

Multiple studies (2023-2025): LLMs find more true positives and fewer false positives

than traditional SAST tools on C/C++ codebases for memory safety issues

# What these cases have in common

All targeted mature, well-audited codebases — not low-hanging fruit

All discovered vulnerabilities that human experts had missed for years

All required human verification and PoC development to confirm exploitability

EXERCISE 2 — BROWSER (15 MIN)

Research Google Big Sleep and OSS-Fuzz AI

Step 1: Research Big Sleep / Project Naptime
Search: “Google Big Sleep AI vulnerability discovery SQLite 2024”
Find the Google Security Blog post about the SQLite zero-day.
What was the vulnerability type?
What was the AI system’s methodology for finding it?

Step 2: Research Google OSS-Fuzz AI
Search: “Google OSS-Fuzz AI fuzzing 2024”
What is the AI component in OSS-Fuzz?
How many vulnerabilities has OSS-Fuzz found total, and how does AI improve the rate?

Step 3: Research academic LLM vulnerability research
Search: “LLM vulnerability detection academic paper 2024”
Find one paper comparing LLM-based code review to traditional static analysis.
Which outperforms? Under what conditions?

Step 4: Personal relevance
If you focus on bug bounty hunting, which AI vulnerability discovery technique
(code review or fuzzing) is more directly applicable to your targets?
What would your first AI-assisted audit look like?

Document: Big Sleep methodology + OSS-Fuzz AI component + your application plan.

✅ The Big Sleep methodology is worth understanding in detail because it represents the current state-of-the-art in LLM-assisted vulnerability research. The key insight from the Google post: the system uses a multi-step approach where the LLM first reads the code to understand intent, then reasons about edge cases, then generates targeted test cases for those edge cases. This is fundamentally different from random fuzzing — it’s hypothesis-driven testing, the same approach a skilled human researcher uses, now automated. My expectation: this approach will be the basis for the next generation of commercial vulnerability scanners within 2–3 years.

Limitations — Where AI Falls Short

The limitations of AI vulnerability discovery matter as much as the capabilities. Understanding them shapes how I use AI tools — specifically, which parts of the research pipeline I trust AI for and which require my own careful analysis. The primary limitations I work around are multi-file context, novel logic bug detection, and false positive rates.

AI VULNERABILITY DISCOVERY — CURRENT LIMITATIONS

# Limitation 1: Context window constraints

Problem: vulnerabilities spanning multiple files/functions require full context

Example: auth bypass requires understanding both auth.py and session.py together

Workaround: chunk code intelligently, maintain call graph context in prompts

# Limitation 2: Novel logic bugs

Problem: AI learns from existing vulnerability patterns → misses bugs with no prior examples

Example: unique business logic flaw in a custom protocol has no training precedent

Reality: this is where human expert time is irreplaceable

# Limitation 3: False positive rates

Problem: AI flags non-vulnerable code as vulnerable → wastes expert review time

Severity: varies by vulnerability class — SQLi FP rate lower than race condition FP rate

Workaround: calibrate expected FP rate per class on known-good/known-bad sample

# Limitation 4: Exploitability assessment

Problem: AI identifies potentially vulnerable code but can’t always determine if reachable/triggerable

Example: buffer overflow found in dead code path that’s never executed in production

Workaround: manual taint analysis for code paths to confirm exploitability

EXERCISE 3 — THINK LIKE A RESEARCHER (10 MIN)

Design Your AI-Assisted Bug Bounty Research Workflow

Design a systematic AI-assisted vulnerability research workflow for bug bounty.

YOUR CONTEXT: You hunt on HackerOne and Bugcrowd.
Typical targets: web applications (PHP/Python/Node.js)
Specialisation: web application vulnerabilities (SQLi, XSS, IDOR, auth bypass)

DESIGN YOUR WORKFLOW:

1. RECON PHASE (AI assistance?)
Which recon tasks benefit from AI assistance?
(Target profiling, attack surface mapping, technology identification)

2. CODE REVIEW PHASE (for targets with open source code)
What is your LLM prompt sequence for web application code review?
Write your “first pass” prompt for a PHP web application.
What vulnerability classes do you trust the LLM on vs. verify manually?

3. BEHAVIOUR TESTING PHASE (for black-box targets)
Can AI help generate test cases for black-box testing?
What would an LLM-assisted fuzzing session look like for a form field?

4. REPORT WRITING PHASE
Can AI help write the vulnerability report?
What parts of the report do you write manually vs. draft with AI?

5. CALIBRATION
How do you track LLM false positive and false negative rates over time?
After 10 findings, what would you change about your workflow?

✅ The report writing question (point 4) has a practical answer that most researchers overlook. AI is excellent at drafting the technical description and remediation recommendation sections of bug bounty reports — these follow a standard structure and benefit from clear, well-organised prose. I write the proof-of-concept and business impact sections manually because these require accurate technical demonstration data and programme-specific business context that AI can’t know. The hybrid approach: AI drafts the structure and boilerplate, researcher fills in the evidence and impact.

AI Vulnerability Discovery — Key Points

Pipeline stages: triage (AI high-value) → pattern detection (AI medium) → exploitation (AI low)

LLM code review: good at injection, dangerous functions, missing auth; poor at logic bugs, race conditions

AI fuzzing: grammar-based corpus generation + semantic mutation outperforms random fuzzing

Documented: Google Big Sleep found SQLite zero-day in 2024 — first AI in-the-wild vuln discovery

Limitations: context window, novel logic bugs, false positives, exploitability assessment

AI Vulnerability Discovery 2026

The pipeline, LLM-assisted code review techniques, AI-assisted fuzzing, real documented discoveries, and the limitations to work around. AI-Powered Exploit Code Generation — how AI takes discovered vulnerabilities and generates working proof-of-concept code.

Quick Check

A security researcher uses an LLM to review a 1,500-line PHP application. The LLM flags 12 potential SQL injection vulnerabilities. On manual review, 9 are false positives (the queries use prepared statements) and 3 are genuine. What does this outcome tell you about how to use LLMs in your vulnerability research workflow?

Frequently Asked Questions

Can AI really find zero-day vulnerabilities?

Yes — documented in production. Google’s Big Sleep system found a zero-day stack buffer underflow in SQLite in 2024, which Google described as the first publicly documented case of an AI system finding an exploitable vulnerability in production software in the wild. Google’s OSS-Fuzz AI component has found hundreds of vulnerabilities in widely-used open-source libraries. These are mature, well-audited codebases — AI is finding bugs that years of human review missed.

What vulnerability types is AI best at finding?

LLMs are most reliable at finding injection vulnerabilities (SQL, command, XSS) where user input reaches dangerous functions without sanitisation, dangerous function usage (strcpy, gets, system()), hardcoded credentials and API keys, missing authentication checks, and insecure deserialization patterns. They are less reliable at race conditions, novel logic bugs, complex multi-function vulnerability chains, and integer overflow edge cases that require precise numeric reasoning.

How does AI-assisted fuzzing differ from traditional fuzzing?

Traditional fuzzing generates random or mutated inputs, relying on coverage feedback to explore code paths. AI-assisted fuzzing uses LLMs to generate semantically valid, targeted inputs — inputs that are valid for the parser (avoiding immediate rejection) but target specific edge cases inferred from code analysis. AI fuzzing complements traditional fuzzing: traditional coverage-guided fuzzing explores breadth efficiently, AI fuzzing generates targeted inputs for specific uncovered code paths or vulnerability hypotheses.

Is AI vulnerability discovery legal and ethical?

The ethics and legality are identical to traditional vulnerability research: testing your own systems, participating in authorised bug bounty programmes, or having written permission from the system owner is legal and ethical. Using AI tools doesn’t change the legal framework — the authorisation requirement applies to the activity (vulnerability testing), not the tools used. AI tools accelerate authorised security research and are used by Google, major security firms, and independent bug bounty hunters in authorised contexts.

How AI and LLMs are discovering zero-days faster than human researchers in 2026

What You’ll Learn

AI Vulnerability Discovery 2026

The AI Vulnerability Discovery Pipeline

LLM-Assisted Code Review

AI-Assisted Fuzzing

Real Documented AI Discoveries

Limitations — Where AI Falls Short

AI Vulnerability Discovery — Key Points

AI Vulnerability Discovery 2026

Quick Check

Frequently Asked Questions

Further Reading

Leave a Comment Cancel reply