End-to-End Testing Is Flaky by Default. Here's the Fix for Dialogs, Bots, and Auth

End-to-End Testing Is Flaky by Default. Here's the Fix for Dialogs, Bots, and Auth
Introduction

Detail
📌Key Takeaways
  1. 1E2E tests flake on three edges Playwright docs don't cover: JS dialogs, bot detection, and auth walls.
  2. 2Global dialog handling (v1.1.0) makes per-test `page.on('dialog')` listeners obsolete.
  3. 3Stealth extraction replaces the `puppeteer-extra` + fingerprint-override stack with a single CLI command.
  4. 4Policy-based Human Assist converts `test.skip()` paths back into real coverage — no fake sessions.
  5. 5Browser-act complements Playwright/Cypress — migration is additive, not a rewrite.


Why Traditional E2E Tests Break (Three Real Scenarios)

Most e2e failures don't come from the happy path. They come from three specific edges.

Scenario 1 — The JavaScript Dialog That Kills the Selector

You write a Playwright test for a "delete item" flow. It passes locally. It passes in CI. Three weeks later, the product team adds a window.confirm("Are you sure?") before the delete fires. Now every test in the suite hangs for 30 seconds and times out, because Playwright's default dialog handler dismisses the dialog and the page is still blocked on the modal.

The Playwright fix is to register a page.on('dialog') listener — but the listener has to be registered before the click that triggers the dialog, and it has to be re-registered for every page instance. Miss one and the suite goes red on that path forever.

Scenario 2 — Staging Passes, Production Fails

Your CI runs headless Chromium against staging. All green. You ship. Production sits behind Cloudflare, and the first real-user flow trips the bot-detection layer. Your synthetics dashboard lights up, but your e2e suite never caught it because staging didn't have the same protection tier.

The root cause is a fingerprint mismatch: headless Chrome leaks navigator.webdriver, a different WebGL vendor string, and a telltale HeadlessChrome user agent. The usual patch is puppeteer-extra-plugin-stealth, plus a pile of manual fingerprint overrides, plus a proxy rotation, plus keeping all of that in sync with Chrome updates. Each one is a minor project on its own.

Scenario 3 — The Login Step You Can't Automate

Your production login requires a captcha on new devices, an OTP from an authenticator app, and sometimes a manual device-trust prompt. None of these can be scripted without either (a) hard-coding bypass endpoints that don't exist in production or (b) injecting a fake session token that doesn't match the real auth path.

Most teams choose option (b) and accept that the login flow itself is untested. That's a quiet decision to let a load-bearing piece of the product exist outside the test suite.


Fix #1 — Auto-Handle JS Dialogs Without Manual Listeners

The Playwright approach looks like this:

``javascript
// Playwright: you have to remember this line on every page
page.on('dialog', async dialog => {
await dialog.accept();
});

await page.goto('https://app.example.com/items/42');
await page.click('button.delete');
// If you forgot the listener above, this line hangs.
`

The problem isn't the API — it's that you have to remember. Across 40 test files, someone forgets. The result is a single flaky test that nobody can reproduce on their machine.

Browser-act's v1.1.0 release introduced a global dialog-handling behavior. By default, all alert, confirm, and prompt dialogs are auto-dismissed. If you need a test that actually exercises the dialog branch, you flip one flag:

`bash

Default: all dialogs auto-handled, tests never hang


browser-act stealth-extract https://app.example.com/items/42

Only when you explicitly want to assert against a dialog:

browser-act stealth-extract https://app.example.com/items/42 --no-auto-dialog

`

The default is right 95% of the time because 95% of the time you don't care about the dialog — you care about what's on the page after it's dismissed. One default, zero listeners, zero "did I remember to register it" bugs.

And when you do need to assert against the page state after a dialog is dismissed — not against dialog text itself — use browser-act evaluate to read directly from the DOM or a global variable, skipping brittle selector chains entirely:

`bash
browser-act evaluate "document.querySelector('.item-row').dataset.status"

returns "deleted" once the confirm is accepted and the row is gone


`

💡 What's new in v1.1.0: Global dialog handling turns the "someone shipped a new confirm() prompt and the entire cart suite went red" pattern into a non-event. The dispatch logic moves from every test file into the runner. Your beforeEach block loses the dialog handler, your test authors stop copy-pasting the same five lines, and the opt-out flag --no-auto-dialog is there only when you're deliberately testing dialog text.


Fix #2 — Run E2E Against Stealth Browsers, Not Naked Headless

Here's the setup for a traditional stealth-capable Playwright suite:

`javascript
const { chromium } = require('playwright-extra');
const stealth = require('puppeteer-extra-plugin-stealth')();
chromium.use(stealth);

const browser = await chromium.launch({
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
'--disable-web-security',
],
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ...',
viewport: { width: 1920, height: 1080 },
locale: 'en-US',
timezoneId: 'America/New_York',
// ... eight more overrides
});
`

That's before you add proxy rotation, before you solve the inevitable WebRTC leak, and before Chrome 135 ships and breaks half of the fingerprint patches.

The equivalent in browser-act:

`bash
browser-act stealth-extract https://app.example.com \
--content-type html \
--proxy http://user:pass@proxy.example.com:8080 \
--output result.html
`

One command. Anti-detection is handled by the CLI, not by a stack of plugins you maintain. If you're running e2e against a production URL that sits behind Cloudflare, DataDome, or PerimeterX, this is the difference between a test that runs and a test that 403s.

The wider point: your e2e suite should run against the same protection layer real users hit. If staging doesn't have the bot-detection tier, bake stealth into the test runner itself so it doesn't matter.

The --content-type flag controls the return format — html for DOM assertions, text for readability checks, json when you're extracting a specific structure. Combined with --timeout and --output, a single stealth-extract call replaces the five-file test scaffold that a traditional Playwright suite needs just to reach the page.

💡 What's new in v1.1.0: browser-act stealth-extract collapses fingerprint normalization, proxy support, and WebRTC leak mitigation into one command. Every environment — local dev, staging CI, production synthetics — exercises the same anti-detection surface, so "green in staging, 403 in prod" stops happening because nobody forgot to sync a stealth plugin. The playwright-extra chain, the manual User-Agent overrides, and the Chrome-update babysitting all drop out of your test repo.


Fix #3 — Human-in-the-Loop for Auth, Payment, and Unsolvable Captcha

This is the hardest one, because the honest answer is that some test paths can't be fully automated and you have to decide what to do when the agent gets stuck.

Browser-act's v1.1.0 ships four policy triggers that a test can register up front:

Policy

When It Fires

What Happens

credential-login

The page requests a username/password and none is configured

Test pauses, emits a Human Assist URL, waits for manual sign-in

captcha-unsolvable

The solver fails after N attempts

Test pauses, emits the URL, waits for a human to solve once

payment-confirmation

The URL pattern matches a known payment gate

Test pauses automatically — no accidental real charges in CI

operation-stuck

The agent has retried N times without progress

Test stops, outputs the last-known state, exits non-zero

The pattern is the same across all four: instead of retrying in a loop until the CI runner kills the job with no diagnostic, the test stops, hands off, and resumes once the human signal comes back. You configure it in a policies.yaml per suite:

`yaml
policies:
credential-login:
action: human-assist
notify: slack://#qa-handoff
captcha-unsolvable:
action: human-assist
notify: slack://#qa-handoff
max_solver_attempts: 2
payment-confirmation:
action: pause
url_patterns:
- "/checkout/confirm"
- "/payment/"
operation-stuck:
action: stop
max_retries: 3
report: ./artifacts/stuck-state.json
`

In CI, a stuck test posts a Human Assist URL into a Slack channel. A human clicks, signs in once, and the test picks up from the post-login state. The post-auth state is preserved with browser-act session save and rehydrated with --session on every subsequent test run until it expires — one human sign-in covers the next hundred test executions.

This is the piece that Playwright and Cypress structurally can't do. Not because the frameworks are bad — they're not — but because their execution model assumes a fully deterministic agent. Real production flows aren't fully deterministic, so the test has to be able to ask for help without dying.

💡 What's new in v1.1.0: Policy-based Human Assist replaces the test.skip('can't automate in CI') graveyard. The four triggers cover the paths teams consistently abandon — MFA-gated login, unsolvable captcha, payment confirmation, and agent stuck loops — each with a Human Assist URL that routes to Slack, email, or any webhook. A stuck path stops being a test gap; it becomes a 30-second human interrupt, after which the session replays across the rest of the suite without asking again.


BrowserAct

Stop chasing flaky tests. Ship e2e suites you trust.

  • ✓ Global dialog handling — no per-test page.on('dialog') listeners
  • ✓ Stealth extraction — same anti-detection surface for staging CI and prod
  • ✓ Policy-based Human Assist — MFA, captcha, payment paths rejoin coverage
  • ✓ Drop-in alongside Playwright & Cypress — no rewrite, no lock-in

Putting It Together — A Complete E2E Flow

Here's a shopping-cart e2e that exercises all three fixes. The scenario: sign in, add two items, go to checkout, verify the total, stop before payment so CI doesn't charge the real card.

`javascript
// e2e/cart-checkout.js
const { runBrowserAct } = require('browser-act');

async function test() {
const session = await runBrowserAct({
url: 'https://shop.example.com',
stealth: true,
policies: './policies.yaml',
});

// Step 1: login (policy handles OTP if needed)
await session.goto('/login');
await session.fill('input[name=email]', process.env.TEST_EMAIL);
await session.fill('input[name=password]', process.env.TEST_PASSWORD);
await session.click('button[type=submit]');
// If MFA fires, credential-login policy pauses here and
// posts a Human Assist URL to Slack. Once the human signs in,
// the session resumes automatically.

// Step 2: add items (any confirm() dialogs auto-dismissed)
await session.goto('/products/widget-a');
await session.click('button.add-to-cart');
await session.goto('/products/widget-b');
await session.click('button.add-to-cart');

// Step 3: checkout
await session.goto('/cart');
const total = await session.text('span.cart-total');
assert.equal(total, '$47.98');

// Step 4: proceed to payment — policy stops the test here
await session.click('button.checkout');
// payment-confirmation policy matches /checkout/confirm
// Test exits cleanly with state captured.
// No real card is charged in CI.
}
`

The interesting thing is not the length — it's about the same line count as a Playwright test. The difference is that every flaky point in a traditional e2e is now handled by a default, a CLI flag, or a policy, not by a piece of glue code you have to maintain.

Compare the artifacts:

Artifact

Playwright Version

Browser-act Version

Dialog handling

page.on('dialog') listener in every test file

Global default, zero code

Stealth fingerprint

playwright-extra + stealth plugin + custom overrides

stealth: true flag

OTP / captcha

Skipped or mocked

credential-login / captcha-unsolvable policy + Human Assist URL

Payment safety

Commented-out click + manual verification

payment-confirmation policy

Stuck detection

CI timeout, no diagnostic

operation-stuck policy + state dump

Every row on the right is in policies.yaml. The test file just expresses intent.


Migration Path from Playwright or Cypress

You don't have to replace your whole suite. The smart migration path is additive:

Week 1 — Replace the dialog hacks. Find every page.on('dialog') in your repo. Replace the ones that just accept/dismiss with a browser-act stealth-extract call in the same step. Keep the ones that actually assert against dialog text as Playwright steps. You'll delete roughly 40% of your dialog boilerplate.

Week 2 — Wrap the bot-wall tests. Identify the 5-10 tests that flake against staging because of bot detection. Change those specific tests to fetch the page via stealth-extract first, then hand the HTML to Cheerio or JSDOM for assertions. Keep everything else as Playwright. For visual regression on those same pages, browser-act screenshot --full-page --output baseline.png captures a deterministic screenshot you can diff; --har on any extract call writes a full HAR file you can replay to verify API contract expectations.

Week 3 — Add policies for your known unautomatable paths. Login with OTP. Payment confirmation. Anywhere your suite currently has a test.skip() because "we can't automate this in CI." Add the matching policy, wire Human Assist to Slack, and those skipped paths come back into coverage.

At no point do you rewrite the whole suite. You're filling gaps, not replacing the framework.

💡 Pro Tip

The clearest signal that this migration is working: the number of test.skip() calls in your repo goes down. Those are the paths you gave up on. Each one you bring back is a piece of the product that was quietly untested.


When NOT to Use Browser-Act for E2E

Browser-act is a browser-level tool. It's the wrong choice for:

  • Pure unit tests. If the code doesn't touch a browser, Jest or Vitest is faster and simpler. Don't spin up a Chromium instance to test a pure function.
  • API contract tests. If you're testing the response shape of POST /api/orders, use Supertest or Postman collections. Browser-act would just add a layer of indirection.
  • High-concurrency load tests. For 10,000 virtual users hammering an endpoint, k6 or Locust is the right tool. Browser agents aren't designed to simulate load.
  • Tests of non-browser flows. Mobile app e2e, CLI tool e2e, server-to-server integration — all out of scope.

The rule of thumb: use browser-act where a real browser + real user flow is the thing you're testing. Everything else, use the tool that was built for that layer.


Conclusion

Most "what is end-to-end testing" articles spend 2,000 words on the testing pyramid and then stop before saying anything about why the pyramid's top tier is flaky in practice. The flakiness isn't a mystery — it's three specific failure modes:

1. A dialog that blocks the selector chain. Fixed by a sane default.
2. A bot wall that makes staging lie to you. Fixed by stealth at the runner level.
3. An auth or payment step that needs a human. Fixed by a policy that pauses and hands off.

None of these require throwing out your existing suite. They require admitting that e2e tests have to deal with the real shape of production — modals, bot checks, OTP screens — and building the tooling around that shape instead of around the happy path.

If you're already running Playwright or Cypress in CI and the flakiness tax is eating your team's time, start with the one policy that hurts the most. Most teams start with credential-login because it unlocks the largest chunk of "stuff we stopped testing because we gave up."


Get Started

  • Install: npm install -g browser-act (or brew install browser-act)
  • Docs for v1.1.0 policies: browser-act/skills/browser-act
  • The one command for the skeptic: browser-act stealth-extract https://your-staging.example.com --output test.html` — run this against any staging URL that currently flakes in CI, and see whether the output matches what Playwright returns. If they differ, the bot wall was lying to your test suite.

The test you stopped writing is the one that would have caught the bug.



Automate Any Website with BrowserAct Skills

Pre-built automation patterns for the sites your agent needs most. Install in one click.

🛒
Amazon Product API
Search products, track prices, extract reviews.
📍
Google Maps Scraper
Extract business listings, reviews, contact info.
💬
Reddit Analysis
Monitor mentions, track sentiment, extract posts.
📺
YouTube Data
Channel stats, video metadata, comments at scale.
Browse 5,000+ Skills on ClawHub →


Frequently Asked Questions

What's the difference between end-to-end testing and UAT?

End-to-end testing is automated verification that a full user flow works through real browser interaction, run by engineers as part of CI. UAT (User Acceptance Testing) is manual validation by actual users or stakeholders confirming the product meets business requirements before release. E2E catches regressions; UAT catches whether you built the right thing. They are not substitutes — a mature release pipeline runs both, and e2e failures should block a release long before UAT even begins.

What's an example of an end-to-end test?

A typical e2e test covers a login-to-checkout flow: open the homepage, click sign-in, enter credentials, verify the dashboard loads, navigate to a product, add it to the cart, go to checkout, verify the total matches the expected value, and stop before actual payment. Every step uses the real browser against a real (staging or production) backend, with no mocks in between. The "Putting It Together" section above walks through exactly this test written against browser-act.

Is end-to-end testing worth the effort for large projects?

It's worth it only if the flakiness tax is lower than the cost of the bugs it catches. The reason e2e often fails this math is that teams spend 60% of their e2e maintenance time on three specific failure modes — dialogs, bot detection, and auth walls — that add no product-level value. Cutting those three maintenance costs (which is what policy-based handoff and stealth extraction do) changes the math significantly. Measure it: count your e2e suite's flake rate and the hours your team spends triaging flakes per week. If the ratio is worse than 1:1 coverage-to-maintenance, fix the three failure modes before deciding e2e isn't worth it.

What's the difference between end-to-end testing and integration testing?

Integration testing verifies that two or more components talk to each other correctly, usually through code-level interfaces and often with mocks at the outer boundary. End-to-end testing verifies the full stack — frontend, backend, database, third-party dependencies — behaves correctly for a user flow, with no mocks. Integration tests are faster and cheaper; e2e tests are slower but cover the interaction surface your users actually hit. Most production incidents sit in e2e's domain, not integration's.

Can I use browser-act alongside Playwright or Cypress?

Yes, and for most teams that's the right starting point. The recommended pattern: keep Playwright or Cypress for the assertion-heavy tests where DOM introspection and selector chains are the right abstraction, and use browser-act for the parts of the flow where those frameworks struggle — stealth-extraction through bot walls, global dialog handling, and policy-based handoff for auth and payment paths. The migration section above walks through a three-week additive rollout.

How is policy-based human assist different from test retries?

Retries assume the test will eventually succeed if you run it enough times. That's true for race conditions, but it's false for MFA prompts, unsolvable captchas, and payment gates — no amount of retrying produces a one-time password. Policy-based handoff recognizes that some steps structurally require a human, pauses the test at exactly that point, emits a URL a human can act on, and resumes once the human completes the step. It converts a `test.skip()` into coverage, which retries cannot.

End-to-End Testing Is Flaky by Default. Here's the Fix for D