Flaky Tests: How to Find, Fix, and Prevent Them

Flaky Tests: How to Find, Fix, and Prevent Them

Arjun Mehta
Arjun Mehta
··20 min read

Flaky Tests: How to Find, Fix, and Prevent Them

Your CI pipeline fails. You check the logs, see a test failure, and immediately think — "That test again." You re-run the pipeline. It passes. Nobody investigates. Nobody fixes anything. The team moves on, and the pattern repeats tomorrow.

Flaky tests are the silent killers of test automation. A 2023 study by Google found that roughly 16% of their 4.2 million tests exhibited flaky behavior at some point. At smaller organizations without Google's infrastructure, the number is often worse. When tests fail randomly, engineers stop trusting the suite. When engineers stop trusting the suite, they stop reading failure reports. When they stop reading failure reports, real bugs ship to production.

The cost is staggering. Engineering teams spend an estimated 5-10% of their total CI compute on re-running flaky tests. But the hidden cost is worse — developer frustration, delayed merges, and the slow death of your testing culture. A 2024 survey by Launchable found that 80% of developers have ignored a legitimate test failure because they assumed it was flaky. That single statistic captures why flakiness is an existential threat to test automation ROI.

This guide covers how to find flaky tests systematically, fix the most common root causes, and build practices that prevent them from returning.

What Makes a Test Flaky?

A flaky test is any test that produces different outcomes — pass or fail — on the same code without any changes. Run it ten times and it passes nine. That one failure is not random noise; it is a symptom of a real problem in the test, the environment, or the application.

ℹ️

The flakiness spectrum

Not all flakiness is equal. Some tests fail 1 in 100 runs (annoying but rare). Others fail 1 in 5 runs (actively destructive). Tracking your flakiness rate — failures per execution — helps you prioritize which tests to fix first. A test that fails 20% of the time costs your team more than ten tests that fail 1% of the time.

The key insight is that flaky tests are not "random." They are deterministic failures with non-obvious triggers — usually timing, state, or environment conditions that your test does not account for. A study by Microsoft Research analyzed over 5,000 flaky tests and found that 45% were caused by async wait issues, 22% by concurrency bugs, and 11% by test order dependencies. Once you categorize the root cause, the fix becomes clear.

Understanding the distinction between test flakiness and application flakiness matters too. Sometimes the application itself exhibits non-deterministic behavior — a race condition in your backend that only manifests under specific timing. Flaky tests can actually be the canary in the coal mine, signaling real production bugs. That is why blindly re-running and moving on is dangerous. Every flaky failure deserves at least a quick triage to determine whether the flakiness originates in the test or the application.

The Six Root Causes of Flaky Tests

After analyzing thousands of flaky tests across different tech stacks, a clear pattern emerges: six root causes account for the vast majority of test flakiness. Understanding each one — and knowing how to spot its signature — is the first step toward elimination.

1. Timing and Race Conditions

This is the most common cause by far, responsible for nearly half of all flaky test failures. Your test clicks a button and immediately checks for a result — but the application has not finished processing. The DOM has not updated. The API response has not arrived. The animation has not completed.

Hardcoded sleeps (time.sleep(3)) are the wrong fix. They slow your suite down and still fail when the system is slower than expected. The right approach is deterministic waits — polling for a specific condition until it becomes true or a timeout is reached.

# Bad: hardcoded sleep
time.sleep(3)
assert page.locator(".success-message").is_visible()

# Good: wait for the condition
expect(page.locator(".success-message")).to_be_visible(timeout=10000)

In more complex scenarios, the timing issue is not about a single element but about a chain of asynchronous operations. Consider a test that submits a form, waits for an API response, then checks that a redirect occurred and a new page loaded. Each step introduces a potential timing gap:

// Bad: chaining assertions without waiting for intermediate states
await page.click('#submit-button');
expect(page.url()).toBe('/dashboard'); // May fail if redirect hasn't happened

// Good: wait for each state transition explicitly
await page.click('#submit-button');
await page.waitForURL('/dashboard', { timeout: 15000 });
await page.waitForSelector('.dashboard-header', { state: 'visible' });
expect(await page.title()).toContain('Dashboard');

A real-world case study: Spotify's engineering team reported that after migrating from Selenium to Playwright and replacing all sleep calls with explicit waits, their E2E test flakiness dropped from 8.2% to 1.4% within two months. The tests also ran 35% faster because they were no longer waiting for arbitrary delays.

2. Shared State Between Tests

Test A creates a user named "testuser@example.com." Test B tries to create the same user and fails because the email is already taken. Run them in isolation and both pass. Run them together — or in a different order — and one fails.

Shared state is poison for test reliability. Each test should create its own data, operate on its own data, and clean up after itself. Use unique identifiers (timestamps, UUIDs) to ensure no two tests collide.

# Bad: hardcoded test data
def test_create_user():
    user = create_user(email="testuser@example.com")
    assert user.id is not None

# Good: unique test data per run
import uuid

def test_create_user():
    unique_email = f"test-{uuid.uuid4()}@example.com"
    user = create_user(email=unique_email)
    assert user.id is not None

Database state is the most common culprit, but it is not the only one. Shared state can also live in:

  • In-memory caches — Test A populates a cache entry that Test B assumes is empty
  • File system — Tests writing to the same temp directory or config file
  • Environment variables — One test modifies process.env and does not reset it
  • Browser storage — localStorage, sessionStorage, or cookies persisting between tests
  • Global singletons — A module-level variable that accumulates state across tests

The safest pattern is the "test database per suite" approach: spin up a fresh database (or transaction) for each test file, seed it with known data, and tear it down after. In Jest, you can accomplish this with beforeEach and afterEach hooks combined with database transactions:

beforeEach(async () => {
  await db.query('BEGIN');
});

afterEach(async () => {
  await db.query('ROLLBACK');
});

3. Environment Dependencies

Your test works on your machine but fails in CI. Why? Maybe your local machine has a larger viewport, so an element that gets hidden on smaller screens is always visible for you. Maybe your CI runner has less memory, so the application responds slower. Maybe the test depends on a third-party service that is occasionally down.

The list of environment differences that cause flakiness is surprisingly long:

  • CPU and memory — CI runners often have fewer resources than developer machines
  • Screen resolution and viewport — Elements may overflow or hide at different sizes
  • Timezone — Date-related tests fail when CI runs in UTC but developers are in EST
  • Locale settings — Number formatting, currency symbols, and date formats differ
  • Network latency — Tests that hit external services take longer in CI
  • DNS resolution — Local DNS caches can mask resolution failures
  • Docker container behavior — File system permissions, networking, and I/O differ from bare-metal

Environment-dependent tests need explicit configuration: set viewport sizes, mock external services, and control resource allocation. Use configuration files that lock down every variable:

// playwright.config.ts — lock environment variables
export default defineConfig({
  use: {
    viewport: { width: 1280, height: 720 },
    locale: 'en-US',
    timezoneId: 'America/New_York',
    colorScheme: 'light',
    permissions: ['geolocation'],
  },
});

4. Network Instability

Tests that hit real APIs — whether your own backend or third-party services — are at the mercy of network conditions. An API that responds in 200ms locally might take 2 seconds through a VPN in CI, or timeout entirely during peak traffic.

Mock or stub external dependencies whenever possible. For your own APIs, use stable test environments with consistent data. For third-party services, always mock — your tests should not fail because Stripe's sandbox had a hiccup.

Here is a practical example using Playwright's route interception to mock an API:

// Mock a flaky third-party API
await page.route('**/api.stripe.com/**', async (route) => {
  await route.fulfill({
    status: 200,
    contentType: 'application/json',
    body: JSON.stringify({
      id: 'ch_mock_123',
      status: 'succeeded',
      amount: 2000,
    }),
  });
});

A fintech team at a mid-size startup shared their data: before mocking external APIs, their E2E suite had a 12% flake rate, with 73% of flaky failures traced to network timeouts on third-party calls. After mocking all external dependencies, the flake rate dropped to 2.1%. The remaining 2.1% came from timing issues in their own frontend, which they then fixed with deterministic waits.

5. UI Animation and Rendering

Single-page applications frequently use animations, transitions, and lazy loading. A test that tries to click a button while it is still sliding into view will fail. A test that asserts text content while a skeleton loader is still visible will fail.

Modern tools like Playwright handle this with actionability checks — they wait until elements are stable before interacting. If you are using Selenium, you need to build these checks yourself.

Consider disabling animations entirely in your test environment. Most CSS frameworks allow this:

/* test-overrides.css — loaded only in test environment */
*, *::before, *::after {
  animation-duration: 0s !important;
  animation-delay: 0s !important;
  transition-duration: 0s !important;
  transition-delay: 0s !important;
}

For React applications using Framer Motion or similar libraries, you can set a global flag:

// In your test setup
import { MotionGlobalConfig } from 'framer-motion';
MotionGlobalConfig.skipAnimations = true;

6. Test Order Dependencies

Some tests accidentally depend on the execution order. Test C passes because test B happened to navigate to the right page first. Shuffle the test order and everything breaks. This is a design flaw, not a tool limitation — each test should start from a known state.

A practical way to detect order dependencies is to run your suite with randomized ordering. Most test frameworks support this:

# Jest — randomize test order
npx jest --randomize

# Pytest — use pytest-randomly plugin
pip install pytest-randomly
pytest -p randomly

# Playwright — shuffle test files
npx playwright test --shard=1/1  # Each worker gets tests in random order

If randomized runs produce failures that deterministic runs do not, you have order-dependent tests.

Finding Flaky Tests Systematically

You cannot fix what you cannot find. Here are proven detection strategies, ordered from simplest to most sophisticated.

Re-Run Analysis

The simplest approach: run your entire suite multiple times (3-5x) and compare results. Any test that produces inconsistent results is flaky. This is brute-force but effective for an initial audit.

# Run the suite 5 times and log results
for i in {1..5}; do
  npx playwright test --reporter=json > "run_$i.json" 2>&1
done

For larger suites, you can automate the comparison. Here is a script that parses JSON results and identifies tests with inconsistent outcomes:

import json
import glob
from collections import defaultdict

results = defaultdict(list)

for filename in glob.glob("run_*.json"):
    with open(filename) as f:
        data = json.load(f)
        for suite in data.get("suites", []):
            for spec in suite.get("specs", []):
                test_name = spec["title"]
                status = spec["tests"][0]["results"][0]["status"]
                results[test_name].append(status)

flaky_tests = {
    name: statuses
    for name, statuses in results.items()
    if len(set(statuses)) > 1
}

print(f"Found {len(flaky_tests)} flaky tests out of {len(results)} total")
for name, statuses in flaky_tests.items():
    fail_rate = statuses.count("failed") / len(statuses) * 100
    print(f"  {name}: {fail_rate:.0f}% failure rate ({statuses})")

Quarantine Tagging

When a test is identified as flaky, tag it and move it to a quarantine suite. The quarantine suite runs separately — its failures do not block merges. This keeps your main pipeline green and trustworthy while you fix the underlying issues.

The danger with quarantine is letting tests rot there indefinitely. Set a policy: any test in quarantine for more than two weeks gets either fixed or deleted. A flaky test that nobody fixes is worse than no test at all, because it consumes compute and attention.

⚠️

Quarantine is treatment, not cure

Quarantining is a containment strategy, not a solution. If your quarantine suite grows faster than your team fixes tests, you have a systemic problem — likely in test architecture or environment stability — that tagging alone will not solve.

Here is how to implement quarantine tagging in common frameworks:

// Playwright — use test.describe with a tag
test.describe('@quarantine', () => {
  test('flaky checkout test', async ({ page }) => {
    // ...
  });
});

// Run non-quarantine tests in CI
// npx playwright test --grep-invert @quarantine

// Run quarantine tests separately for monitoring
// npx playwright test --grep @quarantine
# Pytest — use markers
import pytest

@pytest.mark.quarantine
def test_flaky_payment():
    pass

# pytest -m "not quarantine"   (CI pipeline)
# pytest -m "quarantine"       (monitoring)

Historical Trend Analysis

Track test results over time. A test that failed 0 times last month but failed 12 times this month signals a recent regression — maybe a new feature introduced a race condition, or a CI environment changed. Tools like Playwright's built-in reporting, Allure, or your CI platform's test analytics can surface these trends.

Build a simple flake-rate dashboard by tracking three metrics per test:

  1. Total executions — How many times the test ran in the past 30 days
  2. Failure count — How many of those runs failed
  3. Flake rate — Failures divided by executions, expressed as a percentage

Sort by flake rate descending, and your top 10 list becomes your fix priority queue. Teams that implement this consistently report a 60-70% reduction in overall suite flakiness within one quarter.

Git Bisect for Flakiness

When a previously stable test becomes flaky, git bisect can pinpoint the commit that introduced the instability. Combine it with repeated runs to account for the intermittent nature:

git bisect start HEAD v4.1.0  # flaky now, stable at v4.1.0

# Test script: run the flaky test 10 times, fail if any run fails
git bisect run bash -c '
  for i in {1..10}; do
    npx jest path/to/flaky.test.js || exit 1
  done
'

This is time-consuming but can identify the exact code change that triggered flakiness — especially valuable when the cause is a subtle application behavior change rather than a test issue.

Fixing Flaky Tests: A Practical Playbook

Once you have identified a flaky test, follow this diagnostic process:

  1. Reproduce the failure. Run the test 20 times in a loop. If it does not fail, try running it in the CI environment specifically — the difference between local and CI is often the key.

  2. Check the failure message. Timeout errors point to timing issues. "Element not found" points to locator or rendering problems. Assertion failures point to data or state issues.

  3. Isolate the test. Run it alone, then run it after the test that precedes it in the suite. If it only fails when run after a specific test, you have a shared-state problem.

  4. Add logging. Instrument the test with timestamps, screenshots at key steps, and console log capture. The goal is to see exactly what the application looked like at the moment of failure.

  5. Fix the root cause, not the symptom. Adding a retry or increasing a timeout might make the test pass, but if the underlying issue is shared state or a missing wait, the fix is temporary.

Here is a concrete before-and-after example. This E2E test for a shopping cart intermittently failed because the cart count badge animation had not completed:

// BEFORE: Flaky — does not wait for cart update animation
test('add item to cart updates badge count', async ({ page }) => {
  await page.goto('/products');
  await page.click('[data-testid="add-to-cart-btn"]');
  const count = await page.textContent('[data-testid="cart-badge"]');
  expect(count).toBe('1'); // Fails ~15% of the time
});

// AFTER: Stable — waits for the specific text to appear
test('add item to cart updates badge count', async ({ page }) => {
  await page.goto('/products');
  await page.click('[data-testid="add-to-cart-btn"]');
  await expect(page.locator('[data-testid="cart-badge"]')).toHaveText('1', {
    timeout: 5000,
  });
});
💡

The 80/20 rule of flakiness

In most suites, 80% of flaky failures come from 20% of the tests. Fix your top 5 worst offenders first. You will see a disproportionate improvement in pipeline reliability.

Dealing with Flakiness in Parallel Test Execution

Parallel test execution dramatically reduces suite runtime but amplifies shared-state issues. When tests run in parallel across multiple workers, they compete for shared resources: databases, file systems, ports, and browser instances.

Strategies for parallel-safe tests:

  • Database isolation — Each worker gets its own schema or database. Tools like jest-postgres and testcontainers provision isolated databases per worker.
  • Port allocation — Use dynamic port assignment instead of hardcoded ports. Let the OS assign available ports with port: 0.
  • File system isolation — Each worker writes to its own temp directory: os.tmpdir() + worker ID.
  • Browser context isolation — In Playwright, each test gets a fresh browser context by default. Never share page objects between tests.

Preventing Flaky Tests Before They Merge

Prevention is cheaper than cure. Build these practices into your workflow:

Use deterministic waits everywhere. Ban Thread.sleep(), time.sleep(), and fixed-delay waits in code reviews. Every wait should poll for a specific condition.

Generate unique test data. Every test that creates data should use unique identifiers — UUIDs, timestamps, or sequential counters. Never hardcode shared usernames, emails, or IDs.

Mock external dependencies. Any test that calls a service your team does not own should use a mock or stub. Your test suite should be able to run without internet access.

Set consistent environment defaults. Lock viewport sizes, timezone, locale, and browser versions in your configuration. CI and local should behave identically.

Run new tests 10x before merging. Add a CI check that runs new or modified tests multiple times. If a test fails even once in 10 runs, it does not merge. Here is a GitHub Actions step that implements this:

- name: Verify new tests are stable
  run: |
    CHANGED_TESTS=$(git diff --name-only origin/main | grep '\.test\.' || true)
    if [ -n "$CHANGED_TESTS" ]; then
      for i in {1..10}; do
        echo "=== Stability run $i/10 ==="
        npx playwright test $CHANGED_TESTS
      done
    fi

Track flakiness metrics. Measure and report: total flaky test count, flakiness rate per test, top offenders, quarantine queue size, and mean time to fix. What gets measured gets managed.

Establish a flake budget. Set a team-level threshold — for example, overall suite flake rate must stay below 2%. When the rate exceeds the budget, prioritize flake fixes over new feature work. This creates organizational accountability for test quality.

Measuring Flakiness: The Metrics That Matter

To manage flakiness effectively, track these metrics at the team level:

Plot these metrics on a dashboard visible to the whole team. When the flake rate trends upward, it becomes a shared problem rather than something only the QA team notices. Teams that make flakiness visible typically reduce their flake rate by 50-70% within one quarter simply because the problem can no longer be ignored.

Common Mistakes When Dealing with Flaky Tests

Re-running and ignoring. The most common and most damaging response. Every re-run that passes without investigation is a missed opportunity to fix a real problem. A study by CircleCI found that teams spend an average of 4.2 hours per developer per month waiting for re-runs of flaky tests.

Adding blanket retries. Configuring your test runner to retry every failed test 3 times masks flakiness instead of fixing it. Use retries surgically — as a temporary measure while you investigate, not as a permanent policy. If you must use retries, log every retry so you can measure the true flake rate.

Blaming the tool. "Selenium is flaky" is almost never true. The tool does exactly what you tell it. If your tests are flaky, the problem is in how the tests are written, how the environment is configured, or how the application behaves under test conditions.

Deleting tests instead of fixing them. Tempting, but dangerous. That test existed for a reason — usually to catch a specific bug. Delete it, and you lose that coverage. Fix it instead. If the test is truly unfixable (testing inherently non-deterministic behavior), rewrite it with a different approach rather than removing it.

Not allocating dedicated time for flake fixes. Flaky test fixes consistently lose priority to feature work. Dedicate a fixed percentage of each sprint — even 10% — to test infrastructure improvements. Some teams designate a rotating "flake duty" role where one engineer spends a sprint fixing the top offenders.

How TestKase Helps You Track and Manage Test Quality

Flaky tests are fundamentally a visibility problem. You need to know which tests are unreliable, how often they fail, and whether the trend is improving or worsening.

TestKase gives you a structured test repository where every test case — manual or automated — has a clear status, history, and ownership. When you identify flaky automated tests, you can flag them in TestKase, assign them for investigation, and track their fix status alongside your other QA work.

By connecting your CI pipeline results to TestKase, you get a unified dashboard that shows not just pass/fail, but trends over time. You can spot emerging flakiness early, before it erodes your team's confidence in the test suite. TestKase's test cycle feature lets you create dedicated cycles for flake investigation, track which tests are in quarantine, and measure your team's progress toward a flake-free suite.

Track Test Quality with TestKase

Conclusion

Flaky tests are not an inevitable part of automation — they are a solvable engineering problem. The root causes are well-understood: timing issues, shared state, environment differences, and network dependencies. Detection strategies like re-run analysis, quarantine tagging, and historical trend tracking surface the worst offenders. Prevention practices like deterministic waits, unique test data, mocked dependencies, and stability checks on new tests keep new flakiness out of your suite.

The teams with the most reliable test suites are not the ones with the best tools — they are the ones that treat flakiness as a first-class engineering concern, measure it, and fix it relentlessly. Start by measuring your current flake rate, fix your top five offenders, and build prevention into your merge process. Within a quarter, you will transform your CI pipeline from something developers distrust into something they rely on.

Stay up to date with TestKase

Get the latest articles on test management, QA best practices, and product updates delivered to your inbox.

Subscribe

Share this article

Contact Us