Exploratory Testing in Agile: Techniques That Find Real Bugs
Exploratory Testing in Agile: Techniques That Find Real Bugs
Your automated test suite has 2,400 tests. All passing. Green across the board. And yet, five minutes after release, a customer reports that the checkout page shows a negative total when they apply two discount codes. No scripted test covered that combination. No one thought to try it.
This is the gap that exploratory testing fills.
Scripted tests verify that the software does what you expected. Exploratory testing discovers what you didn't expect — the interactions, edge cases, and user behaviors that no requirements document anticipated. According to research by Cem Kaner, exploratory testing finds 50-60% of defects in mature software products, often uncovering bugs that scripted testing systematically misses.
Yet many agile teams treat exploratory testing as "just clicking around." It's far more than that. Done well, exploratory testing is a disciplined, skill-intensive practice that combines test design, execution, and learning in real time.
Here's how to do it well.
What Exploratory Testing Actually Is
James Bach defines exploratory testing as "simultaneous learning, test design, and test execution." You're not following a script. You're actively investigating the software — forming hypotheses about how it might break, testing those hypotheses, and using the results to guide your next move.
Think of it like a journalist investigating a story versus a journalist reading a press release. Both produce output. But the investigator finds the real story.
Exploratory testing by the numbers
A study by Microsoft Research found that exploratory testing sessions discovered 25% more unique defects per hour than scripted test execution. The defects found were also rated higher in severity, because exploratory testers naturally gravitate toward risk areas.
What exploratory testing is not:
- It's not ad-hoc testing. Ad-hoc testing is unstructured, undocumented, and unrepeatable. Exploratory testing uses charters, time-boxes, and note-taking to maintain structure.
- It's not a replacement for automated testing. Automation handles regression. Exploratory testing handles discovery. They're complementary.
- It's not just for beginners. The best exploratory testers are senior QA engineers with deep domain knowledge and testing intuition built over years.
The Cognitive Science Behind Exploratory Testing
Why does exploratory testing find bugs that scripted testing misses? It comes down to how the human brain processes information differently from test scripts.
Scripted tests follow predetermined paths — they can only verify what someone already thought of. Exploratory testing leverages three cognitive advantages:
Pattern recognition. Experienced testers recognize subtle anomalies that scripts miss — a page loading 200ms slower than usual, a font weight that's slightly off, a validation message that doesn't match the pattern used elsewhere. These observations trigger deeper investigation that often uncovers real defects.
Associative thinking. When a tester notices that the discount code field accepts negative values, they immediately wonder: "What about the quantity field? The shipping cost override? The manual price adjustment?" Scripts test one thing at a time. Human testers make lateral connections.
Contextual judgment. A script can verify that the order confirmation page renders. A human tester notices that the confirmation email takes 45 seconds to arrive, that the order total differs by $0.01 from the cart total, and that the shipping address shows a different formatting than the billing address. Context-aware judgment catches entire categories of bugs that scripted assertions miss.
Session-Based Test Management (SBTM)
SBTM, developed by Jon and James Bach, brings structure to exploratory testing without killing its creative nature. The core unit is the session — a focused, time-boxed period of exploratory testing with a clear purpose.
Anatomy of a Testing Session
Each session has three components:
Charter: A brief statement of what you're exploring and why. "Explore the user registration flow with unusual email formats to evaluate input validation robustness."
Time-box: Typically 60-90 minutes. Short enough to maintain focus, long enough to go deep.
Session notes: A running log of what you tested, what you found, what questions emerged, and what areas you'd explore next.
Session Sheet Template
A session sheet captures the essential information:
After each session, you debrief — ideally with another tester or your team lead. The debrief covers: What did you learn? What risks remain? What should the next session focus on?
Metrics from SBTM
SBTM gives you measurable data about exploratory testing:
- Session count per sprint — How much exploration are you doing?
- Bug discovery rate — Bugs found per session hour
- Coverage distribution — Which areas of the product have been explored recently, and which haven't?
- Charter completion — Did you cover what you intended, or did you rabbit-hole on something unexpected?
- Session-to-bug ratio — The percentage of sessions that discover at least one bug. A healthy rate is 60-80%.
These metrics transform exploratory testing from "we clicked around for a while" into a trackable, reportable activity.
SBTM in Practice: A Sprint-Level Example
Here's how a team might structure exploratory testing across a two-week sprint:
Sprint 14 Exploratory Testing Plan
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
New features this sprint:
- Bulk user import (CSV upload)
- Notification preferences redesign
- API rate limiting
Session 1 (Day 3, 90 min): Sarah
Charter: Explore bulk CSV import with malformed files, large datasets
(100K rows), and special characters to discover data handling
edge cases.
Result: 4 bugs found (encoding issue with UTF-8 BOM, timeout on
50K+ rows, silent skip of duplicate emails, misleading
progress bar)
Session 2 (Day 5, 60 min): David
Charter: Explore notification preferences with rapid toggling,
browser back/forward, and network interruption to discover
state management issues.
Result: 2 bugs found (unsaved changes lost on browser back,
preference toggle doesn't disable if API call fails)
Session 3 (Day 7, 90 min): Sarah + David (paired)
Charter: Explore API rate limiting under realistic load patterns
to verify limits are enforced correctly and error
responses are clear.
Result: 3 bugs found (rate limit resets at wrong interval,
429 response missing Retry-After header, rate limit
applies to health check endpoint)
Session 4 (Day 9, 60 min): David
Charter: Explore the interaction between bulk import and
notification preferences — do imported users get
default notification settings?
Result: 1 bug found (imported users get no notification
preferences set, causing null pointer on notification
service)
Sprint total: 4 sessions, 5 hours, 10 bugs found
Discovery rate: 2.0 bugs/hour
Coverage: 3 new features explored, 2 cross-feature interactions tested
Writing Effective Test Charters
The charter is the single most important element of a successful exploratory session. A weak charter leads to aimless wandering. A strong charter focuses your energy on high-value areas.
Charter Formula
Use this template: Explore [target] with [resources] to discover [information].
Good charters:
- "Explore the password reset flow with expired tokens and manipulated URLs to discover security vulnerabilities."
- "Explore the report generation feature with datasets exceeding 100,000 rows to discover performance and usability limits."
- "Explore the mobile checkout flow on slow 3G connections to discover timeout handling and UX degradation."
- "Explore the file upload feature with files at exactly the size limit, slightly over, and unsupported formats to discover boundary handling."
- "Explore the multi-tenant admin panel switching between tenants rapidly to discover data isolation failures."
Weak charters:
- "Test the login page." (Too vague — what aspect? What are you looking for?)
- "Find bugs in the admin panel." (No focus area, no risk hypothesis.)
- "Verify everything works." (Not a charter — it's a wish.)
Charter generation shortcut
After sprint planning, scan the new stories and ask: "What's the riskiest thing about this story?" Your answer becomes a charter. If the story involves payment processing, your charter might be: "Explore payment processing with boundary amounts ($0.01, $9999.99, $0.00) and currency edge cases to discover calculation errors."
Charter Sources: Where to Find Exploration Targets
Not sure what to explore? These sources consistently yield high-value charters:
Recent code changes. Pull the git log for the current sprint. Every changed file is a potential exploration target. Refactored modules are especially rich — the behavior should be identical, but refactoring introduces subtle regressions.
# Find recently changed files to target exploration
git log --since="2 weeks ago" --name-only --pretty=format: | sort | uniq -c | sort -rn | head -20
Customer support tickets. The areas where customers report confusion, unexpected behavior, or workarounds are areas where the software doesn't match user expectations. These are prime exploration targets.
Error logs and monitoring. If your application logs show a spike in 500 errors on a specific endpoint, that's a signal that the area needs manual investigation beyond what monitoring catches.
Code complexity metrics. Files with high cyclomatic complexity are statistically more likely to contain bugs. Target exploratory sessions at the most complex modules in your codebase.
Feature intersections. Features tested in isolation work fine. Features tested in combination break. "What happens when a user applies a discount code AND uses a gift card AND has a loyalty points balance?" Nobody scripts that test. Exploratory testing finds it.
Testing Heuristics: Structured Thinking Tools
Heuristics are mental models that help you generate test ideas systematically. They prevent the common trap of testing what's obvious and missing what's subtle.
SFDIPOT (San Francisco Depot)
Created by James Bach, this mnemonic helps you think about different quality dimensions:
- S — Structure: What is the product made of? Database schemas, API endpoints, file formats, UI components.
- F — Function: What does it do? Core features, calculations, workflows.
- D — Data: What data does it process? Inputs, outputs, stored data. What happens with empty data, huge data, special characters?
- I — Interfaces: How does it connect to other things? APIs, third-party services, hardware, other applications.
- P — Platform: Where does it run? Operating systems, browsers, devices, network conditions.
- O — Operations: How will it be used in practice? Peak loads, maintenance windows, error recovery, backups.
- T — Time: How does time affect it? Timeouts, expiration, scheduling, time zones, daylight saving transitions.
Applying SFDIPOT to a Real Feature
Let's apply SFDIPOT to a "user profile photo upload" feature:
| Dimension | Exploration Ideas | |-----------|-------------------| | Structure | What file formats are stored? What's the database schema? Is there a CDN? | | Function | Upload, crop, resize, delete, change. What about the default avatar? | | Data | 0KB file, 50MB file, corrupted JPEG, SVG with embedded script, animated GIF, HEIC from iPhone | | Interfaces | CDN integration, image processing service, mobile app upload, profile API | | Platform | Safari (HEIC native), Chrome, mobile browsers, slow connection, offline mode | | Operations | What happens during CDN outage? Storage limits? Concurrent uploads? | | Time | Cache expiration, signed URL expiration, how long does processing take? |
From one feature, SFDIPOT generates 30+ exploration ideas. Pick the riskiest 5-8 for your session charter.
FEW HICCUPPS
Developed by Michael Bolton, this heuristic helps you identify oracles — ways to judge whether software behavior is correct:
- F — Familiar problems: Does it exhibit bugs you've seen before in similar software?
- E — Explainability: Can you explain the behavior to a user without them being confused?
- W — World: Does it match how the real world works?
- H — History: Does it behave consistently with previous versions?
- I — Image: Does it match the company's brand and quality standards?
- C — Comparable products: Does it work like similar products users are familiar with?
- C — Claims: Does it match what the documentation, marketing, or requirements say?
- U — User expectations: Would a real user find this behavior surprising?
- P — Purpose: Does it serve the purpose it was built for?
- P — Product: Is it internally consistent? Does feature A work the same way as feature B?
- S — Standards: Does it comply with relevant standards (accessibility, security, regulatory)?
CRUD Heuristic
For any data entity in the system, systematically test:
- Create — Can you create it? With minimum data? Maximum data? Duplicate data?
- Read — Can you view it? All fields? Permissions? What about deleted entities?
- Update — Can you change every field? What about partial updates? Concurrent updates?
- Delete — Can you delete it? What about dependent data? Soft delete vs. hard delete?
This simple heuristic catches a surprising number of bugs because developers often implement Create and Read carefully but rush Update and Delete.
Boundary Value Heuristic
For every input field or parameter with constraints, test:
- Just below the minimum
- At the minimum
- Just above the minimum
- A typical value
- Just below the maximum
- At the maximum
- Just above the maximum
- Zero
- Empty/null
- Negative (if applicable)
You don't need to apply every heuristic to every session. Pick 2-3 that are relevant to your charter and use them as idea generators.
Note-Taking Strategies
The biggest mistake in exploratory testing is not taking notes. If you can't recall what you tested, what you found, and what you skipped — the session's value evaporates.
What to Record
- Steps you took — Not every click, but enough to reproduce your path
- Observations — Things that seemed odd, even if they're not bugs
- Questions — "Why does this take 8 seconds? Is that expected?"
- Bugs — With screenshots, console logs, and reproduction steps
- Areas skipped — What you intended to cover but didn't get to
- Ideas for next session — Where you'd go next if you had more time
A Real Session Note Example
Here's what effective session notes look like in practice:
Session: Explore bulk CSV import with edge cases
Time: 10:15 AM - 11:45 AM (90 min)
Charter: Test bulk user import with malformed files, large datasets,
and special characters
10:15 — Starting with a clean import. 10-row CSV with valid data.
Import succeeds in 2 seconds. All 10 users created. Good baseline.
10:22 — Trying 1,000 rows. Import takes 8 seconds. Progress bar shows
percentage. All rows imported. Checking database...
NOTE: No duplicate email check during import. If CSV has
duplicate emails, both rows are imported. Bug? Need to verify
expected behavior.
10:31 — Trying 50,000 rows. Progress bar starts, reaches 60%, then...
browser tab crashes. Refreshing shows import is still running
server-side. No way to monitor progress after browser crash.
BUG: No import status endpoint to check progress independently.
10:42 — After 50K import completes (checked via API), user list page
loads slowly — 12 seconds. Pagination works but filter
dropdowns time out on "department" field (now has 50K options).
BUG: Department filter should use type-ahead, not dropdown,
for large datasets.
10:55 — Testing CSV with UTF-8 BOM marker (common from Excel exports).
Import fails with "Invalid file format." Removing BOM marker
fixes it.
BUG: Should handle UTF-8 BOM transparently.
11:05 — CSV with special characters in names: María García, O'Brien,
François. Names import correctly! Good.
11:15 — CSV with empty rows, extra columns, and inconsistent quoting.
Import skips empty rows (good) but silently drops extra columns
without warning (confusing — user might expect an error).
QUESTION: Should extra columns generate a warning?
11:30 — CSV with 0 rows (header only). Import shows "Success: 0 users
imported." Acceptable but could be more helpful —
"No data rows found" would be clearer.
11:40 — Wrapping up. Bugs: 3 confirmed, 1 probable, 2 questions for PO.
Next session: test import with concurrent imports from two users,
and test the rollback behavior if import fails mid-way.
Lightweight Note-Taking Methods
Stream-of-consciousness log: Type as you go. "Tried negative quantity in cart — got -$29.99 total. Bug? Checking if server validates... server accepted the order. Definitely a bug."
Screenshot journal: Take a screenshot every time something interesting happens. Annotate with a one-line note.
Screen recording with voice narration: Record your session and narrate your thinking. Great for debriefs but time-consuming to review.
Structured template: Use a pre-built template with sections for setup, actions, observations, and bugs.
The 5-minute rule
Set a timer for every 5 minutes during your session. When it goes off, jot down one sentence about what you're currently doing and why. This creates a breadcrumb trail that makes post-session reporting painless.
Time-Boxing and Pairing
Why Time-Boxing Matters
Without time constraints, exploratory testing either expands to fill all available time or gets cut short when something "more urgent" comes up. Time-boxing — committing to a specific duration — solves both problems.
Recommended time-boxes:
- Short session (25-30 minutes): Quick exploration of a specific feature or bug-fix area. Good for verifying that a fix doesn't introduce regressions in adjacent functionality.
- Standard session (60-90 minutes): Deep dive into a feature area with a focused charter. This is the workhorse of exploratory testing — long enough to build understanding and go deep, short enough to maintain focus.
- Extended session (2 hours): Cross-feature exploration or end-to-end workflow investigation. Reserve these for critical features before major releases.
The time-box should be treated seriously. When the timer ends, stop testing and start documenting. The constraint forces prioritization — you naturally focus on the riskiest areas because you know time is limited.
Paired Exploratory Testing
Two testers exploring together find more bugs than two testers exploring separately. The "driver/navigator" model works well — one person operates the software while the other suggests test ideas, takes notes, and watches for anomalies.
Research from the Agile Testing Fellowship found that paired exploratory sessions produce 30-40% more unique defects than two solo sessions of the same duration. The synergy comes from real-time discussion — the navigator spots things the driver misses, and the collaborative brainstorming generates test ideas neither would have thought of alone.
Pairing is especially valuable when:
- A junior tester pairs with a senior tester (knowledge transfer)
- A tester pairs with a developer (different perspectives)
- A tester pairs with a product owner (business context)
- Two testers from different teams explore a shared feature (cross-domain perspective)
Developer-tester pairing is particularly powerful. The developer knows the code's weak points — "I wasn't sure about the null handling in the address parser" — and can guide the tester toward fragile areas. The tester brings user perspective and testing instincts that developers often lack.
Mob Testing
An extension of pairing, mob testing involves 3-5 people exploring software together. One person drives while the others suggest, observe, and discuss. The driver rotates every 10-15 minutes.
Mob testing is useful for:
- Onboarding new team members (they learn the product and testing approach simultaneously)
- Exploring high-risk features before major releases (multiple perspectives catch more)
- Building shared understanding of complex features across team members
Run mob testing sparingly — it's high-value but expensive in people-hours. A monthly 90-minute mob testing session focused on the sprint's riskiest feature is a good starting cadence.
Combining Exploratory Testing with Automation
Exploratory and automated testing aren't competitors — they're partners.
Exploration feeds automation. When an exploratory session discovers a critical bug, the fix gets automated regression coverage. The exploratory tester identified the risk; the automation engineer locks it down. This is the ideal lifecycle of a test: discovered through exploration, codified through automation.
Automation frees up exploration time. Every regression test you automate is time you don't spend re-checking known functionality — time you can reinvest in discovering unknown problems.
Use automation results to guide exploration. If your automated suite shows that the payment module has the highest failure rate, that's where your next exploratory session should focus. Flaky tests are also exploration signals — a flaky test often indicates an area with race conditions, timing issues, or state management problems that deserve manual investigation.
Explore around automated boundaries. Automated tests typically cover the specified requirements. The areas just outside those boundaries — feature interactions, unusual sequences, edge cases beyond the spec — are where exploratory testing adds the most unique value.
A healthy ratio for most agile teams: 70% automated regression, 30% exploratory testing time. The exact split depends on product maturity — newer products benefit from more exploration, while stable products lean heavier on automation.
The Exploration-Automation Cycle
Here's how the cycle works in a mature agile team:
Sprint N:
1. New feature developed
2. Exploratory session discovers 4 bugs
3. Developers fix bugs
4. Automation engineer writes regression tests for the 4 fixed bugs
5. Exploratory tester moves on to next risk area
Sprint N+1:
1. Automated tests from Sprint N catch a regression (one of the 4 fixes broke)
2. Meanwhile, exploratory testing focuses on the NEW sprint's features
3. No human time wasted re-testing Sprint N's fixes — automation handles it
Result: Each sprint, the automated safety net grows, and exploratory
testing continually pushes into new territory.
Measuring Exploratory Testing Effectiveness
"How do I know if my exploratory testing is effective?" Measure outcomes, not activity:
- Bugs found in exploratory sessions vs. production — Are you catching bugs before customers do? A declining production defect rate alongside consistent exploratory bug discovery is the strongest signal of effectiveness.
- Severity distribution — Are you finding high-severity bugs, or just cosmetic issues? Effective exploratory testing should find proportionally more high and critical bugs than scripted testing, because exploratory testers target risk areas.
- Coverage breadth — Have you explored all major feature areas in the last month? Track which modules have been explored recently to prevent blind spots.
- Bug cluster analysis — Do exploratory sessions consistently find bugs in specific areas? That signals where automation or development practices need improvement.
- Time-to-discovery — How quickly after code deployment does exploratory testing find bugs? The faster bugs are found, the cheaper they are to fix.
Benchmarks for Healthy Exploratory Testing
Based on industry data and team case studies, here are benchmarks to evaluate your exploratory testing practice:
| Metric | Below Average | Average | Strong | |--------|--------------|---------|--------| | Bugs per session hour | < 1 | 1-2 | 3-5 | | % sessions finding at least 1 bug | < 40% | 50-65% | 70-85% | | % of total bugs found via exploration | < 15% | 20-35% | 40-60% | | High/critical severity ratio | < 20% | 30-40% | 50%+ | | Exploration coverage per sprint | < 30% of changed areas | 50-70% | 80%+ |
If your metrics consistently fall in the "below average" range, common causes include: weak charters (too vague), insufficient domain knowledge (testers exploring unfamiliar features), lack of debriefs (no learning feedback loop), or inadequate time allocation (exploratory testing getting squeezed out of sprints).
Common Mistakes
No charter, no structure. Clicking around randomly is not exploratory testing. Without a charter and time-box, you'll cover familiar ground and skip risky areas.
Not taking notes. If you can't show what you explored and what you found, stakeholders will question the value of the time you spent. Notes make exploratory testing visible and credible.
Only exploring new features. Existing features accumulate bugs too — especially after refactoring, dependency updates, or infrastructure changes. Dedicate some sessions to areas that haven't been touched in a while.
Stopping at the first bug. When you find a bug, the instinct is to stop and report it. Resist that instinct. Note the bug, keep exploring. The area around a bug is often where more bugs hide. Bugs cluster — a bug in the discount calculation likely means there are bugs in related calculations too.
Treating exploratory testing as optional. When sprints get tight, exploratory testing is the first thing cut. That's exactly when you need it most — rushed code has more defects. Protect exploratory testing time the same way you protect sprint planning and retrospective time.
Using the same tester for the same area every sprint. Familiarity is an asset, but it's also a blindness risk. Testers who explore the same feature repeatedly develop blind spots — they unconsciously follow the same paths. Rotate testers across areas periodically to bring fresh eyes.
Skipping the debrief. A session without a debrief loses half its value. The debrief is where you synthesize findings, identify patterns across sessions, and plan follow-up work. Even a 10-minute debrief with one colleague dramatically improves the value of each session.
Exploring without context. Don't start an exploratory session cold. Spend 5-10 minutes before the session reviewing: the user story or feature spec, recent code changes, known issues in the area, and customer feedback. This context makes your exploration dramatically more targeted.
How TestKase Supports Exploratory Testing
TestKase gives exploratory testers a place to plan sessions, capture findings, and connect discoveries back to the broader test strategy.
You can create test charters as lightweight test cases, log session results with screenshots and notes, and link any bugs found to the features they affect. When an exploratory session reveals a gap in your scripted coverage, you can generate new test cases directly from your session notes — with AI assistance to fill in the steps and expected results.
For example, if your session notes say "Discovered that bulk import fails with UTF-8 BOM files," you can ask TestKase's AI to generate a complete test case from that one-line finding. The AI produces structured steps (export from Excel with BOM, attempt import, verify error handling) and expected results — saving the tester from spending time on documentation.
Over time, TestKase builds a map of your exploratory coverage, showing which areas have been recently explored and which might be overdue for investigation. That visibility turns ad-hoc exploration into a strategic testing activity. Managers can see at a glance: "We haven't explored the payment module in 3 sprints — that should be next sprint's charter."
The platform also supports collaborative exploration through shared session logs and team dashboards, making it easy for distributed QA teams to coordinate exploratory effort without duplicating coverage.
Bring structure to your exploratory testingConclusion
Exploratory testing is the thinking tester's craft. It requires curiosity, domain knowledge, structured techniques, and the discipline to document what you find. It's not a substitute for automation — it's the complement that catches what automation can't.
The techniques in this guide — session-based management, well-crafted charters, heuristic-driven exploration, disciplined note-taking, and paired testing — transform exploratory testing from "clicking around" into a measurable, repeatable, high-value practice. Teams that apply these techniques consistently report finding 40-60% of their total defects through exploratory sessions, with a higher proportion of critical bugs than scripted testing produces.
Start with one focused session next sprint. Write a charter, set a 60-minute time-box, take notes, and debrief with a colleague. You'll find bugs your automated suite never would.
The best testers don't just verify requirements. They question them.
Stay up to date with TestKase
Get the latest articles on test management, QA best practices, and product updates delivered to your inbox.
SubscribeShare this article
Related Articles
Why Most Test Management Tools Are Overpriced and Outdated in 2026
Legacy test management tools charge $30-50/user/month for decade-old UIs with no AI. Learn why QA teams are switching to modern, affordable alternatives like TestKase — starting free.
Read more →TestKase GitHub Chrome Extension: Complete Setup & Feature Guide
Install the TestKase Chrome Extension to manage test cases, test cycles, and test execution for GitHub issues — directly from a browser side panel.
Read more →The Complete Guide to Test Management in 2026
Master test management with this in-depth guide covering planning, execution, metrics, tool selection, and modern best practices for QA teams of every size.
Read more →