Currents Team

•Jun 19, 2026•

Playwright + Feature Flags: Advanced Test Isolation Strategies

Feature flags break Playwright's isolation model because they live outside the browser context. Learn how to intercept flag evaluation at the network layer, declare flag state as a fixture, and structure parallel runs so no worker can affect another's flag context.

TL;DR: Intercept flag SDK traffic with page.route() so each test controls its own flag payload, independent of the live provider. Declare flag state as a Playwright fixture with { option: true }, not as conditional logic or runtime SDK reads. Scope test identities per worker via parallelIndex so parallel runs never share flag state.

Feature flags and Playwright's isolation model solve different problems at different layers. That mismatch is the source of some of the most confusing flakiness in enterprise CI pipelines. Playwright gives each test a clean BrowserContext. That is, isolated cookies, localStorage, and session storage, effectively an incognito profile per test. The problem is that feature flags are typically not evaluated inside the browser context. They live in backend services, SDK polling loops, or streaming connections that operate outside the browser context Playwright controls.

The consequence of this shows up in parallel runs. Say, for example, Worker 1 runs a beforeAll hook that flips a flag via a management API call to enable a new checkout flow for a shared test identity. Worker 3 is mid-test, using that same identity, and its SDK client receives a patch event over its SSE stream. The UI changes underneath the assertion. The test fails with an "element not found" error that has nothing to do with the code change being tested.

However, when it is run again, it passes. This is because, this time, Worker 1's mutation landed after Worker 3 finished. This is a race condition caused by hidden shared state, not a genuine product bug — the kind of flaky test that erodes CI trust without pointing to a real regression. Most of the races this article addresses are shared mutable state races: two workers read and write the same flag key against a live backend. This specific scenario is a narrower variant, a TOCTOU race, where the test checked UI state at one moment and acted on it at another while a different worker mutated shared infrastructure in between. The fix throughout is isolation, not locking. Without cross-run visibility, it is the kind of failure that looks like infrastructure noise rather than a systematic problem.

This article is about control: how to intercept flag evaluation at the network layer, how to declare flag state as a fixture rather than reading it reactively, and how to structure parallel runs so that no worker can affect another's flag context. But, to understand why those mechanisms are necessary, it helps to be precise about where Playwright's isolation actually ends and where feature flags begin.

Why Feature Flags Break Playwright's Isolation Model

Each BrowserContext is a clean slate for browser-managed state: cookies, localStorage, sessionStorage, IndexedDB, and the HTTP cache. What it does not cover is anything evaluated server-side or via an SDK that communicates with an external backend.

Feature flag evaluation paths fall into four categories:

Server-side evaluation happens when the backend resolves flag values based on a user identity before rendering the page or responding to an API call. The client never sees the flag system at all.
Edge evaluation happens at the CDN or edge function layer. Vercel Edge Config, Cloudflare Workers, and LaunchDarkly's own edge SDKs evaluate flags before the request reaches the application origin. From Playwright's perspective, this looks like server-side evaluation, but it is not controllable via page.route() because the evaluation point is outside the application. Tests against edge-evaluated flags require one of three approaches: a pre-edge test endpoint, header-based overrides (when the edge function supports them), or a test environment that bypasses the edge layer entirely.
Client-side evaluation happens when the browser loads an SDK that opens a connection to the flag provider, fetches a payload keyed to the current user context, and evaluates flags in memory.
Cookie or header overrides are a lighter pattern where the application reads specific request headers like x-feature-override: new-checkout=true and uses them to bypass the SDK evaluation entirely.

Of the four paths, only the last is naturally isolated by Playwright's browser context model. The first three are not.

One pattern worth mentioning before moving on is applications that pre-evaluate flags server-side and inject the result into the initial HTML. React and Next.js apps commonly do this via window.__FLAGS__ = {...} in a script tag or via React Server Components. In this case the client-side SDK bootstraps from the injected values rather than fetching from the flag provider's network endpoint, and page.route() cannot intercept the flag values because they arrive as part of the HTML response. The mechanism for these applications is the SDK's bootstrap option, which the taxonomy section covers.

The Shared Environment Problem

In shared CI environments, flag state is effectively global unless it is explicitly scoped per user. When multiple parallel workers operate under the same test identity and hit the same flag evaluation endpoints, their flag states are entangled. A mutation from one worker propagates to all others sharing that identity, and the timing of that propagation is non-deterministic. This class of failure has a recognizable signature: tests pass locally but fail in CI, failures appear only on specific workers or retry attempts, and results are inconsistent across branches even when the code is identical. As stated earlier, without visibility across runs, these patterns look like infrastructure noise. Debugging Playwright timeouts that stem from flag state changes is particularly painful because the timeout is a symptom. The actual cause, a UI that rendered a different component tree than the test expected, is invisible in the error message.

Conditional Test Logic as a Code Smell

The if (featureEnabled) { ... } else { test.skip() } pattern is the most common antipattern in flag-heavy Playwright suites, and it compounds the shared state problem rather than solving it. When a test reads flag state reactively, it couples itself to an ambient condition it does not control. If that condition changes mid-run, the test's behavior changes with it. The result is a suite where a flag flip in the dashboard can silently change which tests run and what they assert, without any code change. The correct mental model is to treat flag state the same way Playwright treats authentication state. You do not just write if (user.isLoggedIn()) { ... }. You declare auth state as a fixture, provision it before the test runs, and assert against a known condition. Building reliable Playwright tests is fundamentally a question of what your tests declare versus what they discover at runtime. Flag state should always be declared.

This is the same pattern Playwright's own storageState provides for authentication: a JSON snapshot of cookies and localStorage, captured once and provisioned per test or per project. Flag state deserves the same treatment: a JSON snapshot, injected through a fixture, asserted against deterministically. If your team already uses storageState for auth, you already have the mental model.

Knowing what breaks is the first step. The next is understanding the range of approaches available, because not all of them carry equal risk in a parallel environment.

A Taxonomy of Flag Integration Patterns

Not all approaches to flag integration carry equal risk in a parallel test environment, and choosing the wrong one compounds the problem rather than containing it.

The live SDK approach connects the test environment directly to the real flag provider with no interception. Flag values reflect whatever the dashboard says at the moment the SDK initializes. This is appropriate for production smoke tests where the goal is to verify actual system state, but it is the wrong choice for any CI test that needs to be deterministic. The flag can change between when the test run starts and when the assertion fires.

API-level mutation means calling the flag provider's management API before each test to set flag values programmatically. This is more controlled than live SDK, but it introduces its own parallelism problem. Unless every worker operates under a strictly isolated user identity with its own flag scope, a mutation from one worker will affect others. It also requires serial execution or complex coordination to be safe, which negates the benefit of parallelism.

Network interception via page.route() is the approach that achieves genuine per-test isolation. By intercepting the SDK's HTTP traffic before the page navigates, the test controls exactly what flag payload the application receives. The flag provider's live state is irrelevant. This is the pattern the rest of this article is built around.

Cookie and header overrides are the lightest option when the application is architected to support them. If the app reads a request header like x-feature-flag: new-checkout=true and uses it to bypass SDK evaluation, a Playwright test can set that header on every request without any network interception. This works well for features behind a single flag but gets unwieldy when tests need to compose multiple flag states.

Two additional patterns are worth naming for teams with the right application architecture. SDK bootstrap injection uses the SDK's own bootstrap option to pre-populate flag values in memory before the SDK makes any network request. For LaunchDarkly, LDOptions.bootstrap accepts a flag map at initialization. In Playwright, you inject the bootstrap payload via page.addInitScript() before the application code runs:

await page.addInitScript((flags) => {
  (window as any).__LD_BOOTSTRAP__ = flags;
}, ldFlagPayload);

The application reads window.__LD_BOOTSTRAP__ and passes it to LDClient.initialize(). This eliminates the network layer entirely and avoids the SSE connection lifecycle problem. It requires the application to be designed to accept bootstrap from a known global, which not every codebase does.

SDK base URL override points the SDK at a local mock server via LDOptions.baseUrl, streamUrl, and eventsUrl. This suits teams that already run mock services and want a single persistent backing system shared across tests, rather than per-test route handlers. The cost is operating the mock service.

The remainder of this article focuses on network interception because it requires no application-side changes and no additional infrastructure. For teams where bootstrap injection is available, it is often the cleaner answer. While network interception is the right mechanism, fixtures are the right abstraction for managing it. The two of them combined make flag state something a test declares rather than something it discovers.

Fixture-Level Flag Architecture

The right abstraction for flag state in Playwright is a fixture, not a beforeEach hook. Hooks are imperative and order-dependent. If two hooks interact with the same flag endpoint, their combined behavior is not readable from either hook in isolation. Fixtures make that dependency explicit, composable, and automatically tied to the test lifecycle. The goal is to make flag state something a test file declares at the top, not something it configures inside each test body.

The Flag Context Fixture

The core fixture intercepts the flag SDK's evaluation endpoint and returns a controlled payload before the page loads. The critical constraint is that page.route() must be registered before page.goto(). The LaunchDarkly JavaScript SDK fires its initial evaluation request during script initialization on the first navigation. If the route handler is not in place before that request fires, the SDK fetches from the real endpoint, caches the result, and the mock never takes effect. The cleanest way to enforce this ordering is to override the built-in page fixture rather than creating a separate fixture that depends on it. Overriding page means the route handlers are installed at the moment the browser context creates the page object, before any test code can call goto().

An important nuance: Playwright processes route handlers in reverse registration order, so the last registered handler takes priority. Overriding the page fixture guarantees timing (handlers are installed before goto()), but not priority over subsequently registered handlers. If a test or a composed fixture registers an additional page.route() for the same URL pattern after the flag fixture runs, that handler will take precedence over the flag mock. When composing multiple fixtures that each register route handlers, register lower-priority (more general) handlers first and higher-priority (more specific) handlers last.

It is worth knowing that context.route() applies the same interception to the entire browser context, covering all pages, rather than only the one page the page-level override controls. For tests that open new pages such as popup auth flows or multi-tab navigation, context.route() ensures the flag intercept covers every page in the context. For most single-page test flows, the page-level override is sufficient. For multi-page flows, override the context fixture instead and substitute context.route() for page.route().

// fixtures/flags.ts

import { test as base, expect } from '@playwright/test';

type FlagValue = { value: unknown; variation: number; version?: number };

type FlagMap = Record<string, FlagValue>;

function toLDPayload(flags: FlagMap) {

  const out: Record<string, unknown> = {};

  for (const [key, v] of Object.entries(flags)) {

    out[key] = {

      value: v.value,

      variation: v.variation,

      version: v.version ?? 1,

      flagVersion: v.version ?? 1,

      trackEvents: false,

      reason: { kind: 'FALLTHROUGH' },

    };

  }

  return out;

}

export const test = base.extend<{ flags: FlagMap }>({

  // { option: true } makes this overridable via test.use() at describe or file level
  flags: [{}, { option: true }],

  page: async ({ page, flags }, use) => {

    const body = JSON.stringify(toLDPayload(flags));

    // Intercept the polling endpoint (contexts path, used by current SDKs)

    await page.route(/clientsdk\.launchdarkly\.com\/sdk\/evalx\/.*\/contexts/, route =>

      route.fulfill({ status: 200, contentType: 'application/json', body })

    );

    // Intercept the legacy users path (older SDKs before multi-context support)

    await page.route(/clientsdk\.launchdarkly\.com\/sdk\/evalx\/.*\/users/, route =>

      route.fulfill({ status: 200, contentType: 'application/json', body })

    );

    // Intercept the SSE streaming endpoint with a synthetic put event

    await page.route(/clientstream\.launchdarkly\.com\/eval/, route =>

      route.fulfill({

        status: 200,

        contentType: 'text/event-stream',

        headers: { 'Cache-Control': 'no-cache', 'Connection': 'keep-alive' },

        body: `event: put\ndata: ${body}\n\n:keep-alive\n\n`,

      })

    );

    // Silence analytics so events don't leak to the real provider

    await page.route(/events\.launchdarkly\.com/, route =>

      route.fulfill({ status: 202, body: '' })

    );

    await use(page);

  },

});

export { expect };

A nuance with this intercept: route.fulfill() returns a complete HTTP response and closes the connection immediately, so the body is delivered all at once, not streamed. The :keep-alive comment at the end of the SSE body is vestigial; the connection is already closed before it would matter. The SDK treats the closed connection as a stream interruption and reconnects, hitting the same route handler again.

The reconnects do not affect flag evaluation, since each returns the same put event, but they generate visible network noise and console reconnect messages. For tests that assert on a clean console or network panel, the technically precise fix is the times option on page.route(), which limits the route to a single fulfillment and lets subsequent reconnects fail silently:

await page.route(
  /clientstream\.launchdarkly\.com\/eval/,
  route => route.fulfill({ ... }),
  { times: 1 }
);

The alternative is to disable streaming entirely by passing streaming: false to LDClient.initialize(), which removes the SSE connection altogether and is the cleanest option when the application does not depend on real-time flag updates during a test run. Either approach works; { times: 1 } is the lighter change, and streaming: false is more thorough.

The toLDPayload helper is important. If your mock omits required fields like variation, version, or flagVersion, behavior depends on the SDK version. Newer LaunchDarkly SDKs treat the response as malformed and fall back to the caller-supplied default, making the feature appear off. Older SDKs are more permissive and may accept partial responses. In either case, the failure is silent rather than loud, which is exactly why building a typed helper matters more than memorizing the field list. The helper documents what your tests assume the SDK will accept.

For Flagsmith, the equivalent interception targets /api/v1/identities/ with a response array of { feature: { name, type, id }, feature_state_value, enabled } objects. For Split/Harness, you intercept /api/memberships/ and /api/splitChanges separately, and return "control" for any treatment key not in your mock, since that is what the SDK returns when it cannot find a split definition.

Worker-Scoped vs. Test-Scoped Fixtures

Test-scoped fixtures are torn down and rebuilt for every test. This gives maximum isolation but carries overhead proportional to the cost of setting up the fixture. For a page.route() intercept, that overhead is negligible, and test-scoped is the right default for flag fixtures. Worker-scoped fixtures persist for the lifetime of a worker process and are shared across all tests that run on that worker. They are appropriate for an expensive setup that all tests on a worker can safely share, like a logged-in browser storage state. They are dangerous for flag state when different tests on the same worker require different flag configurations. Changing the fixture state for one test changes it for all subsequent tests on that worker. Worker-scoped API mutations are where cross-worker flag collisions originate.

The composition pattern that works in practice is worker-scoped fixtures for things that do not change between tests on the same worker, such as login storage state or database identity, and test-scoped fixtures for flag intercepts, which do change between tests. The two coexist because route handlers are scoped to the page object, which is created fresh per test even when the worker persists.

export const test = base.extend<
  { flags: FlagMap },   // test-scoped
  { account: Account }  // worker-scoped
>({
  flags: [{}, { option: true }],
  account: [async ({}, use, workerInfo) => {
    await use({ email: `e2e+slot${workerInfo.parallelIndex}@example.com` });
  }, { scope: 'worker' }],
  page: async ({ page, flags }, use) => {
    await page.route(/* flag intercepts */);
    await use(page);
  },
});

Each test runs with the worker's persistent identity but its own flag intercepts.

Composing Variants With test.use()

Once the fixture is an option fixture (declared with { option: true }), test.use() can override it at file level or inside a test.describe() block. This is how you express flag variants as first-class test structure without conditional logic.

// tests/checkout.spec.ts

import { test, expect } from '../fixtures/flags';

test.describe('checkout with legacy flow', () => {

  test.use({ flags: { 'new-checkout': { value: false, variation: 1 } } });

  test('renders classic form fields', async ({ page }) => {

    await page.goto('/checkout');

    await expect(page.getByTestId('classic-checkout')).toBeVisible();

  });

});

test.describe('checkout with new flow enabled', () => {

  test.use({

    flags: {

      'new-checkout': { value: true, variation: 0 },

      'express-pay':  { value: true, variation: 0 },

    },

  });

  test('renders express-pay button', async ({ page }) => {

    await page.goto('/checkout');

    await expect(page.getByTestId('express-pay')).toBeVisible();

  });

});

Each describe block declares exactly the flag state it needs. There is no conditional logic. There is no ambient state to read. If the flag configuration is wrong, the test fails because the expected element is not there, which is a genuine product signal, not a test infrastructure artifact.

Percentage Rollouts and Multi-Variate Flags

Enterprise flag systems go beyond boolean on/off toggles. Percentage rollouts, user-segment targeting, and multi-variate flags (string or JSON values with three or more variations) are common in production. The fixture architecture above handles all of these without changes because value is typed as unknown and variation is a numeric index, not a boolean.

The key insight is that you never mock the rollout percentage itself. A 50% rollout is non-deterministic by design: the flag provider evaluates targeting rules server-side and returns a resolved result for a specific user context. Your fixture declares that resolved result directly. "This user sees variation 0" or "this user sees variation 2" is exactly what the SDK would return after evaluating the rollout rules. The test never needs to know the rollout percentage exists.

test.describe('pricing page with control variant', () => {

  test.use({ flags: { 'pricing-experiment': { value: 'control', variation: 0 } } });

  test('renders original pricing table', async ({ page }) => {

    await page.goto('/pricing');

    await expect(page.getByTestId('pricing-original')).toBeVisible();

  });

});

test.describe('pricing page with variant-a', () => {

  test.use({ flags: { 'pricing-experiment': { value: 'variant-a', variation: 1 } } });

  test('renders redesigned pricing table', async ({ page }) => {

    await page.goto('/pricing');

    await expect(page.getByTestId('pricing-redesign')).toBeVisible();

  });

});

test.describe('pricing page with variant-b', () => {

  test.use({ flags: { 'pricing-experiment': { value: 'variant-b', variation: 2 } } });

  test('renders simplified pricing table', async ({ page }) => {

    await page.goto('/pricing');

    await expect(page.getByTestId('pricing-simplified')).toBeVisible();

  });

});

When a multi-variate flag has many variations, maintaining separate test.describe blocks per file gets repetitive. That is where the Projects pattern becomes especially useful. Instead of repeating describe blocks, you model each variation as a project in playwright.config.ts and run the same test file against all of them in parallel.

Deterministic Flag State In Parallel Runs

Parallelism is where flag isolation requirements become concrete. Whether you scale with workers or shards, every assumption about shared state becomes a race condition waiting to trigger.

Per-Worker Test User Isolation

testInfo.parallelIndex is the right primitive for scoping test identities per worker. It is bounded between 0 and workers - 1, and it is stable across retries. Unlike workerIndex, which starts at 1 and increments globally with every worker restart (meaning a restarted worker gets a new, higher index that can exceed workers - 1), parallelIndex represents a stable slot in the worker pool. Worker slot 2 always gets parallelIndex 2, even after a retry. This makes it safe to pre-provision a fixed pool of test identities and allocate them by slot. The distinction matters in practice: after a few worker restarts, workerIndex can exceed the size of a pre-provisioned pool, producing out-of-bounds lookups.

// fixtures/user.ts

import { test as base } from '@playwright/test';
// parallelIndex is preferred over workerIndex for pool-based identity allocation.
// workerIndex starts at 1 and increments globally on every worker restart, meaning
// a restarted worker gets a new, higher index that can exceed workers-1 and fall
// outside a pre-provisioned pool.
// parallelIndex is 0-based, stable across retries, and bounded to 0..workers-1,
// making it safe to pre-provision exactly N identities for N workers.

type Account = { email: string; userId: string };

export const test = base.extend<{}, { account: Account }>({

  account: [async ({}, use, workerInfo) => {

    const slot = workerInfo.parallelIndex;

    const email = `e2e+slot${slot}@example.com`;

    const userId = `test-user-${slot}`;

    await use({ email, userId });

  }, { scope: 'worker' }],

});

The same principle Playwright recommends for per-worker database isolation applies directly to flag identities. The choice of parallelIndex over workerIndex here is deliberate: because parallelIndex is stable across retries while workerIndex increments on every restart, it is the safer key for pre-provisioned identity pools.

One operational question this fixture leaves open: where do the test users come from? Two patterns work. A pre-provisioned pool uses a one-time setup script to create N users matching the worker count, with predictable emails. Tests reference them by slot. This is cheap at runtime, but the pool needs to be recreated when worker count changes. On-demand provisioning creates a user on first use per worker and reuses it across that worker's tests. This handles variable worker counts but each worker pays the creation cost on first run and requires the user-creation API to be idempotent. Most teams settle on a pre-provisioned pool sized to their maximum worker count.

This is an important guard to add to any slot-based fixture: if parallelIndex exceeds the pool size, which can happen when the CI environment spins up more workers than accounts exist, the fixture should throw explicitly rather than silently propagating undefined:

if (slot >= TEST_ACCOUNTS.length) {
  throw new Error(
    `parallelIndex ${slot} exceeds pre-provisioned account pool (${TEST_ACCOUNTS.length} accounts). ` +
    `Add more test accounts or reduce --workers.`
  );
}

Without this guard, an out-of-bounds slot produces undefined for the account and propagates silently into tests.

Avoiding Race Conditions In globalSetup

globalSetup runs once before any worker starts. It has no access to fixtures, no access to the page object, and its only channels back to tests are process.env and the filesystem. It is appropriate for one-time provisioning work, not for per-test state. The trap is using globalSetup to mutate flag state via the management API and expecting that mutation to have fully propagated by the time the first test runs.

Flag propagation through a live SDK is not instantaneous. The management API write can return 200 while the SSE stream to connected clients is still mid-delivery. Polling-based SDKs may not refresh for another 30 seconds. Adding a setTimeout in globalSetup reduces the chance of tests starting before propagation completes, but does not eliminate it. The propagation window is non-deterministic.

The safe use of globalSetup for flag-related work is the Flag Snapshot Pattern. Call the provider's server-side evaluation API once, write the full keyed-by-flag-key response to a JSON file on disk, and then have every worker's fixture serve that file via page.route(). The snapshot is immutable for the duration of the run. Mid-run flag changes in the dashboard have no effect. The snapshot file is also attachable to the test report as an artifact, giving you an exact record of flag state for any given run. For setup work that needs to appear in traces and the HTML report, the modern approach is a setup project with dependencies rather than globalSetup:

// playwright.config.ts

projects: [
  { name: 'flag-snapshot', testMatch: /flag\.setup\.ts/ },
  {
    name: 'chromium',
    use: { ...devices['Desktop Chrome'] },
    dependencies: ['flag-snapshot'],
  },
]

The flag.setup.ts file the config references captures the snapshot:

import { test as setup } from '@playwright/test';
import { writeFileSync } from 'fs';
import { resolve } from 'path';

setup('capture flag snapshot', async () => {
  // This runs in Node, so it hits the server-side SDK evaluation endpoint
  // (sdk.launchdarkly.com), not the client-side endpoint (clientsdk.launchdarkly.com)
  // that the browser SDK uses. The server-side endpoint requires a server-side SDK key,
  // not a client-side ID.
  // Use the evaluation API, not the management API (/api/v2/flags). The management API
  // returns flag configuration (targeting rules, rollout percentages, environment settings),
  // not evaluation results. The fixture's toLDPayload helper expects { value, variation,
  // version, flagVersion }. Those fields do not exist in management API responses.
  const context = Buffer.from(JSON.stringify({ kind: 'user', key: 'snapshot-context' })).toString('base64url');
  const response = await fetch(
    `https://sdk.launchdarkly.com/sdk/evalx/${process.env.LD_ENV_KEY}/contexts/${context}`,
    { headers: { Authorization: process.env.LD_SDK_KEY! } }
  );
  const flags = await response.json();
  writeFileSync(resolve('.flag-snapshot.json'), JSON.stringify(flags, null, 2));
});

The worker fixture then reads .flag-snapshot.json instead of an inline flag map, and serves it via page.route().

The setup project's output appears in the trace viewer alongside the tests that depended on it, which makes debugging propagation issues substantially easier. The fixture architecture above assumes the interception is in place and working. Getting it working correctly requires knowing the specific endpoints and response shapes each provider expects.

Mocking Flag SDKs at the Network Layer

Before reaching for network interception, check whether your application supports bootstrap injection. When it does, page.addInitScript() is cleaner than page.route() because there is no SSE connection to manage and no per-version SDK URL drift to track. The endpoint-specific interception in the following section is the right answer when bootstrap is not available.

Intercepting SDK Endpoints

The LaunchDarkly JavaScript SDK uses two distinct base hosts depending on what it is doing. Evaluation requests go to clientsdk.launchdarkly.com via GET /sdk/evalx/<envId>/contexts/<base64Context> for current SDKs with multi-context support, or /sdk/evalx/<envId>/users/<base64User> for older SDKs that predate the context model. The streaming connection goes to clientstream.launchdarkly.com via GET /eval/<envId>/<base64Context>. Analytics events go to events.launchdarkly.com. Intercept all three. If the streaming endpoint is missing, the SDK treats it as a connection failure and retries indefinitely, generating console noise and potentially causing timeouts in tests that wait on SDK initialization. The evaluation response is a JSON object keyed by flag key. Each flag entry requires value, variation, version, flagVersion, and trackEvents. Omitting any of these does not throw. The SDK falls back silently, which is why malformed mocks produce false passes rather than obvious errors.

Flagsmith's client-side SDK makes a single request to /api/v1/identities/ when flagsmith.identify() is called, and then evaluates all subsequent flag checks from the in-memory response. Intercept that one endpoint and you control all flag evaluations for the session.

Split's browser SDK is more involved. It fetches segment membership from /api/memberships/<userId>, split definitions from /api/splitChanges, and opens a streaming connection to streaming.split.io. Mock all three: the membership endpoint, the split definitions endpoint, and the streaming connection. If the membership or splitChanges response is missing, getTreatment() returns the string "control" without throwing, which is the Split equivalent of LaunchDarkly's silent fallback.

Handling SDK Initialization Race Conditions

Streaming behavior depends on which SDK the application uses. The vanilla JavaScript SDK (launchdarkly-js-client-sdk) does not open a streaming connection automatically. It only streams if the application subscribes to change or change:flag-key events, or if streaming: true is passed explicitly to LDClient.initialize(). The React Web SDK (launchdarkly-react-client-sdk) behaves differently: it subscribes to individual flag change events internally whenever you use variation hooks, which automatically opens a streaming connection. This means most React apps using the SDK will have streaming active by default. You can disable it by passing streaming: false in the SDK options, but the default behavior is to stream.

Check the changelog for your installed version before assuming the streaming intercept is required or optional. When in doubt, include it, since an unnecessary intercept is harmless, while a missing one produces silent stale flag state. For applications using only the vanilla JS SDK without registering change listeners, the streaming intercept is precautionary. If the app under test does not register those listeners, you only need to intercept the single polling endpoint. The streaming intercept in the fixture above is simply a precaution, not a requirement.

If the app does subscribe to change events, you can suppress streaming at the SDK level by passing streaming: false in the client options, which reduces the fixture to a single route intercept. With streaming disabled, the SDK fetches once on initialization and does not open a persistent SSE connection. You then only need to intercept the single polling endpoint, and there is no long-lived response body to keep alive. If disabling streaming is not an option because the application under test relies on it for real-time UI updates, serve a synthetic SSE response that includes a single put event followed by periodic keep-alive comments.

The SDK will stay initialized, and because page.route() handles the entire response, no retry loop can occur. For apps that register a Service Worker, flag requests may never reach your page.route() handlers. When a Service Worker has a fetch handler, it intercepts the request and can serve a response from its own cache or logic. The request never hits the external network, so Playwright's route handler never fires. Without serviceWorkers: 'block' set in the Playwright configuration, a Workbox caching strategy can serve a cached real flag response, bypassing your mock entirely. Setting serviceWorkers: 'block' disables Service Workers for the test session, ensuring requests reach your route handlers.

Generic REST Flag APIs

For homegrown toggle systems that expose a simple REST endpoint, route.fulfill() with a JSON body is sufficient. A common mistake is returning a shape that does not match what the application code actually reads. If the application checks response.flags['new-checkout'].enabled and the mock returns response.flags['new-checkout'] = true, the property access silently returns undefined and the feature appears off. Capture a real network response once, sanitize it, and use that as your mock template.

For long-lived test suites, consider validating mock payloads against a JSON schema or the response type from your API client so that backend shape changes break tests at construction time rather than as silent fallbacks at runtime.

Fixtures and test.use() handle flag variants at the describe and file level. When the goal is running the entire suite against multiple flag configurations simultaneously, Projects are the right tool.

Playwright Projects As Flag Variant Test Suites

The most scalable way to test across multiple flag configurations is to model each configuration as a Playwright project. A project is a named group of tests with its own use configuration, and project-level use merges with the root use rather than replacing it.

// playwright.config.ts

import { defineConfig, devices } from '@playwright/test';

import type { FlagMap } from './fixtures/flags';

export default defineConfig<{ flags: FlagMap }>({

  testDir: './tests',

  fullyParallel: true,

  use: {

    baseURL: process.env.BASE_URL ?? 'http://localhost:3000',

    trace: 'on-first-retry',

    serviceWorkers: 'block',

    flags: {},

  },

  projects: [

    {

      name: 'flags-off',

      use: { ...devices['Desktop Chrome'], flags: {} },

    },

    {

      name: 'flags-variant-a',

      use: {

        ...devices['Desktop Chrome'],

        flags: {

          'new-checkout': { value: true,  variation: 0 },

          'express-pay':  { value: false, variation: 1 },

        },

      },

    },

    {

      name: 'flags-variant-b',

      use: {

        ...devices['Desktop Chrome'],

        flags: {

          'new-checkout': { value: true, variation: 0 },

          'express-pay':  { value: true, variation: 0 },

        },

      },

    },

  ],

});

Run a single variant with npx playwright test --project=flags-variant-a. Run them all in parallel by default. Because each project is a separate named entity in the HTML report and trace viewer, failures are immediately attributable to a specific flag combination, with no manual log-diffing required. This pattern removes conditional logic from test files entirely. The same checkout.spec.ts runs three times, once per project. Each run sees a different flag payload, and all three results appear in the same report under different project labels.

One scaling consideration: each project runs the full test directory against its flag config. A 200-test suite with three variant projects executes 600 test runs, a number that multiplies again by the shard count in CI. For large suites, scope each variant project to the tests that actually depend on the relevant flags using testMatch or testIgnore — scaling Playwright in CI already multiplies runs through sharding, and adding unconstrained variant projects compounds the problem. Running every test against every variant is rarely the right default.

The fixture architecture and project structure cover how to control flag state during a run. Equally important is how you manage flag state across the lifecycle of a feature, from rollout to cleanup.

Flag Lifecycle Management

The starting point is making the relationship between a test and the flag key it depends on explicit and queryable.

Tagging Tests By Flag

Tagging tests by the flag key they depend on makes it possible to run only the tests affected by a flag change. Playwright's tag convention uses @ in the test title:

test('renders express-pay button @flag-express-pay', async ({ page }) => { ... });

Run all tests for a specific flag key with npx playwright test --grep @flag-express-pay. When a flag is removed from the codebase, the tag gives you an immediate list of tests to clean up.

Asserting Against Declared State

Tests should assert that the application behaves according to the flag state the fixture declared, not according to what the live SDK reports. The fixture controls the flag state. The test's job is to verify that the application responds to that state correctly. If the test reads the flag value from the SDK to decide what to assert, it has reintroduced the ambient-state dependency the fixture was meant to eliminate.

For teams making longer-term flag architecture decisions: Open Feature is a CNCF-hosted standard that defines a provider-agnostic interface for flag evaluation. LaunchDarkly, Flagsmith, ConfigCat, and others publish Open Feature providers. Tests written against the Open Feature interface continue to work after a provider migration because the evaluation API is standardized.

Important nuance: the OpenFeature abstraction is at the evaluation API level, not the network level. The browser SDK still makes provider-specific HTTP requests under the hood, which means the network-interception techniques in this article still apply per-provider. OpenFeature does not give you a single mock point that bypasses those calls. For testing without any network layer at all, the OpenFeature SDK ships a TypedInMemoryProvider (in @openfeature/web-sdk) that resolves flags from an in-process map. Configuring your application to accept an injected provider and swapping in TypedInMemoryProvider during tests is the cleanest OpenFeature-native approach and sidesteps the page.route() layer entirely.

For flags whose evaluation depends on date or time, such as rollout windows or scheduled releases, Playwright's page.clock API installs a synthetic clock the application sees as the current time:

await page.clock.install({ time: new Date('2026-06-01T00:00:00Z') });
await page.goto('/feature-page');

Combined with the fixture pattern, this lets a single test evaluate the same flag under different temporal conditions. Note that page.clock only affects the browser's clock. Server-side flag evaluation uses the server's clock, so this pattern works only for client-side or bootstrap-injected flag state.

Cleanup For API-Level Mutations

When API-level flag mutation is unavoidable, a fixture with try/finally is the only safe pattern:

export const test = base.extend<{ setFlag: (key: string, enabled: boolean) => Promise<void> }>({
  setFlag: async ({}, use) => {
    const touched: { key: string; original: boolean }[] = [];

    const mutate = async (key: string, enabled: boolean) => {
      const original = await getFlag(key);
      touched.push({ key, original });
      await patchFlag(key, enabled);
    };

    try {
      await use(mutate);
    } finally {
      for (const { key, original } of touched.reverse()) {
        await patchFlag(key, original).catch(err =>
          console.error(`[cleanup] failed to restore ${key}:`, err)
        );
      }
    }
  },
});

One race condition this pattern does not address: if multiple workers mutate the same flag key against a live backend in parallel, their original-value reads can interleave such that both workers record the same pre-mutation value as their restoration target, meaning one worker's cleanup overwrites the other's.

The mitigation is either to namespace flag mutations per worker using parallelIndex, or to avoid API-level mutation entirely in favor of network interception, which is what the rest of this article recommends.

The try/finally block runs whether the test passes, fails, or times out. The .catch in the finally block prevents a restore failure from masking the original test error. Without this pattern, a single test failure leaves a flag in a modified state that can break every subsequent test on that worker. The API keys that enable programmatic flag mutation are secrets, and they require the same discipline as any other CI credential.

Securing Flag API Credentials

Flag management API keys are secrets. They should live in CI environment variables and never appear in test output, trace files, or logs. The non-obvious risk is that headers constructed dynamically from environment variables can appear in Playwright traces. If your fixture uses route.continue() to pass through requests with a dynamically constructed Authorization header, Playwright captures that header in the trace, including the actual secret value. Set trace: 'on-first-retry' rather than 'on' to reduce the surface area, and explicitly redact the Authorization header in any trace post-processing. Running Playwright tests without the pain covers the broader CI secrets configuration in detail. Everything covered so far solves the determinism problem.

Integrating With Currents For Flag Variant Visibility

Declaring flag state and isolating it per test solves the determinism problem. It does not solve the diagnostic problem: when a flag-related failure occurs in CI, how do you trace it back to a specific flag state, worker, or run without manually diffing logs? This is where Currents becomes the second half of the architecture.

It captures traces, screenshots, step timing, and test metadata across runs and aggregates them into dashboards that surface flakiness trends, worker-specific failures, and retry patterns. For flag-variant testing, the most useful Currents capability is project-level tagging. Each flag variant project in playwright.config.ts can carry tags via the metadata.pwc.tags field:

{
  name: 'flags-variant-b',
  metadata: { pwc: { tags: ['flags-variant-b', 'express-pay-on'] } },
  use: { ... },
}

These tags appear in the Currents dashboard as filter axes. You can pivot a flakiness chart by variant, compare failure rates between flags-off and flags-variant-b, and immediately see whether a particular variant started failing after a specific commit.

At the run level, CURRENTS_TAG=ld-snapshot-${GIT_SHA} correlates the flag snapshot version with the run, so when a failure is flag-induced, you can trace it to the exact snapshot that was active. Currents also distinguishes between tests that fail consistently and tests that fail only on retry. Flag-induced failures almost always fall into the retry-only category, because the flag state is non-deterministic rather than systematically wrong. That pattern is a reliable signal that the test is reading the ambient state rather than the declared state, and Currents surfaces it without requiring you to manually diff the run logs.

Final Considerations

The mental shift that makes flag testing reliable at scale is treating flag state as test infrastructure, not environmental context. Infrastructure is provisioned, versioned, and owned by the test suite. Environmental context is ambient, shared, and outside the suite's control. Every approach in this article, from page.route() fixtures to the Flag Snapshot Pattern to project-based variant matrices, is an implementation of that shift. Flag issues are also observability problems, and not just setup problems.

A team can have a technically correct mocking strategy and still accumulate flag-related flakiness because it lacks visibility into which variant a failure came from, or whether a retry-only failure correlates with a recent flag rollout. Declaring flag state explicitly solves the determinism problem. Having cross-run visibility via a tool like Currents solves the diagnostic problem. Both are necessary at scale. Teams that declare flag state, isolate it per test, and monitor behavior across runs end up with test suites where every failure is a real signal and every pass is a genuine guarantee. That outcome is achievable without custom tooling or outsized CI investment. It requires getting the abstractions right: fixtures over hooks, declared state over ambient state, and network interception over live SDK connections.

The patterns in this article are composable. A team does not need all of them at once. The order in which they become necessary usually follows the growth of the suite: page.route() fixtures first, then per-worker identity scoping, then Projects for variant matrices, then a cross-run observability layer to close the diagnostic loop. The first three are infrastructure decisions the team makes regardless of tooling. The fourth is a tooling decision: build the observability layer in-house or use something off the shelf like Currents. For teams who have hit the maintenance cost of a homegrown reporter, the off-the-shelf answer tends to win.

Scale your Playwright tests with confidence.
Join hundreds of teams using Currents.

Learn More

Trademarks and logos mentioned in this text belong to their respective owners.