Currents Team

•May 04, 2026•

Designing Playwright Tests That Survive UI Refactors

Your UI refactor didn't break the app, it broke your tests. Learn how semantic Playwright test design decouples your suite from implementation details for good.

A design system migration ships with the interface looking correct, components rendering as expected, and user flows behaving as intended. Then CI runs and three dozen Playwright tests go red.

None of these failures reveal a broken feature. They expose a more costly problem: a test suite quietly coupled to implementation details all along. The application behaved identically, yet CSS class names and DOM nodes had already changed. The tests missed it entirely because they were never observing actual user behavior.

Selectors like div.btn-primary-new-style or nth-child(3) test how the UI was built at a specific moment, not what users actually experience. When that moment changes, the tests break, and every hour spent chasing selectors is an hour not spent on meaningful coverage. Selector choice alone does not resolve the underlying problem.

The fix is designing tests at an abstraction level that stays stable when implementation details change. This article covers the patterns that make a test suite resilient through design system migrations, component library upgrades, and UI refactors.

Understanding why tests break during refactors

Refactors that preserve application behavior but break test suites are not just a selector problem. They are a design problem, and the two require different solutions: selector choice (which locator strategy you use) and abstraction layer design (how interaction logic is organized across your test suite). Getting the fix right starts with knowing which type of failure is actually happening. Most engineers treat these failures as isolated incidents and apply quick workarounds. The pattern repeats because the underlying cause is never resolved.

At Google, 84% of transitions from pass to fail in their test infrastructure involve a flaky test. While flakiness introduces randomness into test outcomes, it is only one category of failure. Implementation-coupled tests add a separate layer to that cost. They fail consistently after updates to the UI structure, producing failures that appear like product issues but trace back to selector design.

Many of these failures result not from broken features, but from tests monitoring the wrong behaviors. What follows is a breakdown of the five most common coupling patterns that make Playwright suites fragile under refactors. Each one shows a clear failure signature and carries a measurable cost.

Selector coupling to visual structure

Structural selectors encode the DOM hierarchy at a specific point in time. A selector like div > section > form > input describes how the HTML was arranged when the test was written, not what the user actually sees or does.

Refactors often reorganize that arrangement, with forms moving into modals, components gaining new layout containers, or child components being extracted for reuse. The underlying behavior remains stable, and the selector no longer reflects the current structure. The test fails even though the product continues to work as expected because the path the test relied on has changed.

The DOM changes frequently, and tests that depend on structure will break when the page is reorganized. What remains stable is what users interact with: roles, labels, and accessible names. These convey an element's purpose rather than its position in the HTML.

Selector coupling to styling artifacts

CSS class names generated by component libraries are build artifacts, for example .MuiButton-containedPrimary, .css-1x2y3z, or ._button_a3f2k_1. These strings are outputs of the build process, not stable identifiers. When a library version is bumped or a build configuration changes, class names can be rewritten entirely. The behavior remains unchanged, and the selector binding is lost.

Tests that rely on auto-generated CSS classes or XPath expressions tied to DOM structure break whenever developers refactor the UI. In component-library-heavy codebases, that bar for breakage is low. A Material UI upgrade from v4 to v5 rewrote the class naming scheme entirely, replacing stable names with generated hashes. A CSS Modules configuration change rotates the hash suffix. In both cases, the interface remains the same from the user's perspective.

In a Tailwind migration, this pattern becomes more pronounced as a CSS-in-JS conversion replaces one or two class names with a dozen utility classes. Every one of them differs from what was there before, and any test using class-based selection against that component fails. The behavior remains unchanged, but the class list no longer matches anything the test expects.

Coupling to component internals

Some tests reach into component internals: internal element IDs generated by the library, assumptions about specific HTML element types, or XPath expressions that target a fixed DOM path inside a component.

For example, a test targeting input[type=text] will fail as soon as a component library replaces its native input with a custom implementation. This pattern shows up repeatedly when adopting headless UI libraries and design systems with more advanced accessibility features. While users see the same behavior, the underlying HTML element changes, causing the test to fail.

Component libraries often introduce wrapper nodes for layout or accessibility, breaking deeply chained selectors. A test targeting the library's internal structure examines the library's implementation decisions. These internal details can change with any minor version, causing the test to fail even if the application behaves correctly.

Playwright's locators pierce open shadow DOM by default, so getByRole, getByText, and the other semantic locators work inside shadow roots without any extra configuration. The fragility comes from using XPath inside shadow DOM (XPath does not pierce shadow roots) or targeting closed-mode shadow roots, which Playwright does not support. Semantic locators remain stable across shadow DOM boundaries; structural ones do not.

This pattern is most damaging in authentication flows, where engineers rely on internal identifiers for password inputs, OTP fields, and modal containers because the component library does not expose clean semantic attributes. These tests break during design system upgrades, even when authentication works perfectly, because the component's internal structure has shifted.

Implicit coupling via position and order

Index-based selectors and :nth-child expressions rely on elements appearing in a fixed order, which breaks the moment a refactor reorganizes the layout.

When a navigation item is added above an existing one, every index that follows shifts. A form field reordered for UX reasons breaks any test that targets the third field by position, and a data table updated to highlight more relevant columns invalidates selectors tied to column index. None of these changes affect how the application behaves, yet they cause position-based tests to fail.

Avoiding positional selectors like nth(3) is well-established practice. The problem goes beyond fragility: these selectors fail without explaining what behavior they were meant to verify. The failure message reflects only a position change, giving the debugging engineer little insight into what actually went wrong.

Text content coupling

Tests that select by exact text content tie test behavior to content decisions. A copy update, an i18n migration, or a content management update breaks the selector, even though the application continues to work as expected.

Text selectors are not inherently flawed. In Playwright, getByText() is a first-class locator designed to reflect how users interact with the interface. When the text itself is part of the user-facing contract, such as headings, legal disclaimers, or visible status messages, asserting on text is appropriate and necessary.

The problem emerges when text is used as a stable identifier for elements whose content is expected to change. In those cases, the test is no longer validating behavior, but mirroring implementation details that evolve independently of functionality. Text-coupled tests break when products introduce new locales. The application remains correct across languages, but the tests no longer match the updated content.

Partial text matching improves resilience but does not remove the coupling. A test using getByText(/sign in/i) survives capitalization changes but still breaks when the product team renames the action from "Sign in" to "Log in," even though nothing about the underlying behavior has changed.

A survey of 335 professional software developers and testers across different domains found that the primary cost of unreliable tests is not the computational overhead of re-runs, but the gradual loss of confidence in test results. Text-based selectors accelerate that problem by producing failures that engineers learn to ignore. Copy changes, the test flags it, someone updates the string. Over time, failures lose meaning, and real issues become harder to spot.

The selector hierarchy: what to use and when

Not all selectors age equally. Some survive a design system migration without a single change, while others fail the moment a developer renames a utility class or upgrades a component library. What separates them is what each selector is anchored to.

User-facing attributes and explicit contracts produce more stable tests because they tie directly to what the application intentionally exposes. This creates a clear selector hierarchy. Where a selector sits in that hierarchy determines whether a suite needs constant maintenance or mostly holds up on its own.

Tier 1: ARIA role and accessible name

page.getByRole('button', { name: 'Add to cart' }) sits at the top because it queries based on semantic role and accessible name, not DOM structure alone. Role-based selectors survive layout changes and class updates because well-built design systems often preserve semantic roles by default. A refactor that converts a <div role="button"> to a native <button> does not break this selector, and neither does a Tailwind migration that rewrites every class name on the component.

There is a second benefit. When a component breaks getByRole, it also breaks its accessibility contract, so the locator doubles as a built-in accessibility check. A login button that no longer exposes its role to assistive technology represents a functional regression, even if the UI still appears correct.

For elements with dynamic accessible names such as order numbers or user-specific text, regex matchers handle the variation cleanly: getByRole('heading', { name: /order #\d+/i }).

Tier 2: Label and placeholder

page.getByLabel('Email address') works by binding to the form label association, using for/id, aria-labelledby, and aria-label to find the right element. Label-based locators remain stable across input type changes, component library updates, and layout refactors, as long as the label text and its association stay intact.

getByPlaceholder serves as a fallback for unlabelled inputs, but introduces more fragility. Placeholder text changes frequently for UX reasons and carries less semantic weight than a label in the accessibility tree.

Tier 3: Explicit test attributes

data-testid, or a team-defined equivalent such as data-cy or data-test, defines a clear contract between the application and the test suite. The attribute exists solely for testing and remains stable through visual and structural refactors. It works best when text or role-based selectors are likely to change.

Playwright has first-class support for this via page.getByTestId(). The attribute name is configurable in playwright.config.ts via use.testIdAttribute, so you can standardize on whatever convention your codebase already uses.

Used selectively on meaningful elements, data-testid is a net positive. It introduces a stable contract between the application and the test suite without coupling tests to implementation details.

However, it comes with a trade-off: it introduces test-specific markup into production HTML. This is addressable. Tools like babel-plugin-react-remove-properties strip data-testid attributes at build time, so the markup never ships to production. Without a consistent naming policy (enforced through a custom ESLint rule or a CI check that validates attribute presence on critical components) it creates inconsistency across the codebase. Applying it selectively to meaningful UI elements such as interactive components, key containers, and state indicators keeps it effective. Annotating every DOM node creates false confidence rather than stability.

Tier 4: Text content (with caveats)

page.getByText() sits at the bottom of the hierarchy and is best reserved for content that has no semantic role, label, or test attribute. Passing a regex like /submit order/i handles case variations without requiring an exact string match.

getByText() is sensitive to content updates, so i18n migrations, copy edits, and CMS-driven changes can all break the locator even when the application behaves the same. When failures show up repeatedly in this tier, it points to a missing data-testid rather than a problem with the feature itself.

What never to use

CSS classes generated by component libraries such as .MuiButton-containedPrimary or .css-1x2y3z reflect build output, not user-facing behavior. They change frequently during design system updates, which makes them unreliable selectors.

Deeply chained CSS selectors are equally fragile. A small layout change or an extra wrapper can break the entire path without affecting how the feature works. Component libraries introduce these changes without warning.

XPath with positional predicates, nth-child without a filtering strategy, and dynamically generated id attributes create the same kind of risk. They depend on structure and ordering that do not stay stable, leading to failures that reveal more about the selector than the application.

Scoping and chaining locators

A good selector strategy ensures the test targets the right element. A good scoping strategy keeps it there as the page changes. The two are related but distinct, and solving only one leaves a common failure unaddressed.

Problems start when a page contains multiple elements that match the same locator. A page with two "Confirm" buttons, one in a form and one in a modal, becomes ambiguous if the test does not narrow the scope. Playwright enforces strict mode by default, meaning any locator action that matches more than one element will throw an error and fail the test immediately. This is not just a best practice concern, it is a framework constraint.

A common workaround is using an index such as .nth(0) or .nth(1). That relies on position and breaks as soon as the layout changes. The more reliable path is semantic scoping.

Container scoping

page.getByRole('dialog').getByRole('button', { name: 'Confirm' }) anchors the locator to a semantic container, keeping the test focused on what the user sees rather than how the DOM is arranged.

Locators should reflect how a user navigates the interface, not trace a CSS path through the DOM. "In the confirmation dialog, click the Confirm button" remains stable through layout refactors, while "the third button in the second div inside the modal wrapper" depends on structure that shifts with any layout change. Moving the dialog to a different position in the DOM breaks the second selector, while the first stays intact.

Filtering locators

locator.filter({ hasText: /.../ }) and locator.filter({ has: page.getByRole(...) }) solve the problem of multiple matching elements by narrowing the result set based on content or children rather than position.

The table row pattern illustrates this: page.getByRole('row').filter({ has: page.getByText('Order #12345') }) targets the correct row by its content, not its position, so it stays accurate when data is reordered or generated dynamically. Index-based row selection breaks in both cases. In authentication flows that render user-specific data in tables such as session lists and device management panels, this pattern separates stable tests from ones that fail whenever the data order changes.

Avoiding locator chains that re-encode structure

Chaining selectors through the DOM hierarchy does not add stability. It moves the structural coupling further down the chain where it is harder to spot.

page.locator('.container').locator('.form').locator('.input') is still a structural selector, just written across three lines. It adds no stability over a single selector and only hides the coupling.

Each step in a chain should narrow the target, such as scoping from a named dialog to a labeled button. Moving from a wrapper class to a child class to a grandchild class only mirrors the layout, which changes frequently. The first approach holds up when the UI changes; the second depends on structure that doesn't.

Page object design for refactor resilience

Centralizing selectors and interactions into a dedicated abstraction layer (whether page objects, app actions, or domain helpers) keeps changes contained and prevents failures from spreading across the test suite.

Selectors alone do not prevent test breakage. You can have role-based locators, semantic scoping, and data-testid attributes, yet still see failures across multiple files when a single component changes. The root of the problem lies in how tests are organized.

Page objects are one approach to this. Playwright's own documentation increasingly highlights the app actions pattern as an alternative: rather than wrapping a page into a class, you attach high-level helper functions directly to Playwright fixtures. App actions work well for cross-cutting flows that don't map cleanly to a single page, while page objects remain a natural fit when you're modeling a specific UI surface with its own set of interactions. In both cases, the principle is the same: keep interaction logic out of individual tests.

When selectors are scattered throughout individual tests, even a small UI change can trigger updates across several files. Centralizing them in a dedicated layer keeps those changes contained.

Centralizing selectors in page objects

Page objects group selectors in one place and make them reusable across tests. The idea is simple, but it often breaks down in real codebases. A common failure pattern is duplication, where the same getByRole('button', { name: 'Sign in' }) call appears across many test files. When a product team changes the action from "Sign in" to "Log in," the functionality remains the same, but multiple tests fail at once. Fixing it means updating every occurrence, and every file touched is a merge conflict waiting to happen.

A page object should reflect what a user can do and observe, while keeping implementation details out of the tests. Tests interact with methods, not raw locators. The selector lives once, in one place, and that is the only file that needs to change when the UI does.

Page objects that expose behavior, not selectors

A page object property that returns a locator, such as get addToCartButton(), is only a small step up from placing the selector directly in the test. It centralizes the locator, but the test still depends on how the UI is built.

A method like addItemToCart(sku: string) reflects what the user is trying to do. Locators and interaction details stay inside the page object, so when the UI changes, updates happen in one place. For example, if the "add to cart" flow introduces a confirmation modal after a design system update, that change is handled inside the page object. The tests continue to call the same method without needing updates.

Component-level page objects for design system migrations

Page-level page objects work well for stable applications, but design system updates demand more precision. Building reusable page objects around user behavior instead of UI structure keeps test suites stable as the application changes.

A DatePickerComponent page object handles every interaction with the date picker, no matter which page it appears on. When the design system upgrades the component, only DatePickerComponent needs updating. This moves the maintenance focus from the test suite to the page object library, maintained by the team closest to the changes.

We've seen this compound over time: each component-level page object you build is a file you do not have to touch during the next upgrade.

Coordinating with frontend teams

Test resilience is not owned by a single team. When a component PR changes internal structure without checking how tests depend on it, failures follow. Poor communication between developers and testers remains one of the most common sources of quality issues in engineering teams. Addressing it requires changes in how teams work together, not just more discussion.

Data-testid as a shared contract

Treating data-testid as part of a component's public API changes how you work with it. It moves from an informal convention to something documented and expected. When a component ships with data-testid="checkout-submit", that attribute becomes part of its interface and should be reviewed in every PR. Renaming or removing it without considering the test suite introduces avoidable breakage.

Consistency comes from enforcement. A custom ESLint rule or a CI check that validates attribute presence on critical components turns that expectation into something you can rely on. Without that discipline, test maintenance starts to dominate. Engineering time that should go toward new coverage goes to fixing broken selectors instead.

Pre-refactor test audits

Before a large UI refactor begins, with selectors centralized in page objects you can quickly trace where components are used across the test suite. If forty tests depend on a date picker's internal structure, that number should inform the refactor plan before any changes begin.

If you skip the pre-refactor audit, you tend to discover the impact through failing CI runs. If you run the audit first, you can prepare by introducing data-testid attributes on critical components, sequencing changes to limit disruption, and maintaining coverage throughout the transition. The audit replaces reactive debugging with planned, controlled changes.

Playwright component testing as a complement

@playwright/experimental-ct-react mounts individual React components in a real browser using the same Playwright API as the integration suite. A DatePicker component can be tested in isolation before it appears in any user flow. When a design system upgrade introduces a new rendering pattern, the component suite catches regressions early, before they reach integration tests.

The experimental label matters. The API continues to change, setup requires extra Vite configuration, and there are real limitations around passing complex Node.js objects to the browser context. It works best alongside integration tests. Component tests focus on element behavior in isolation; integration tests cover full user journeys that cross context, routing, and API boundaries. When managing a design system migration, you get faster feedback at the component level, which reduces the cost of each upgrade.

Validating resilience: how to know your tests are actually decoupled

After a UI refactor, a green CI run can be misleading because tests may pass without actually monitoring the changes. Verifying that the suite truly responds to updates requires deliberate checks, not assumptions.

Selector resilience audit

Before running anything, a codebase-wide search for page.locator('.' in your test files will surface CSS class selectors immediately. These are the most likely coupling candidates. Static analysis catches them without any test execution.

Then run the dynamic check: rename a CSS class, change a build hash, or swap an element type that tests rely on, then run the suite. If tests fail, hidden dependencies exist that code review alone did not expose.

The process takes a few hours, but finding the same coupling mid-migration can cost days of engineering time. A visual refactor that produces zero test failures is the target. If tests fail after a CSS-only change with no functional impact, the selector design needs work.

Tracking test failures by root cause

Coupled tests and flaky tests can look similar. Both fail without pointing to a broken feature, but the difference shows up in how those failures group. Selector-coupled tests tend to fail in clusters after UI-related PRs such as design system changes, component library upgrades, or Tailwind migrations.

Selector coupling adds to that cost by triggering failures at the wrong time. A failure after a visual refactor is a bug in the test suite, not a regression. Treating it as noise allows coverage gaps to grow unnoticed.

Tracking whether failures cluster after UI changes or logic updates makes the pattern easier to spot. Visual refactors should produce zero test failures unless application behavior has changed.

Code review criteria for selector quality

Most selector coupling enters codebases during code review. Reviewers check whether a selector works, but rarely whether it will stay stable.

A few focused checks make a difference. Does this locator depend on DOM hierarchy? Does it reference a CSS class that a library upgrade could change? Is it scoped to a semantic container? Does it rely on text that a copy update might break?

Selector review deserves the same attention as API design. A fragile locator added today creates maintenance work with every refactor. If you apply these checks consistently, you spend far less time dealing with selector failures.

Tests that break for the right reasons

The selector hierarchy is not a ranking of personal preference. It is a ranking of what stays stable. getByRole queries what the app promises to expose to users and assistive technology. That contract rarely changes for arbitrary reasons. A CSS class generated by a build tool changes whenever the build changes. A data-testid is an explicit commitment your team makes. Text selectors are whatever the copy team decided that day.

Resilience compounds. The first UI refactor after you invest in semantic selectors and stable page objects produces fewer failures than the previous one. Over time, the suite earns a reputation as a reliable signal, and that reputation is what lets test outcomes actually influence engineering decisions.

Tracking failure patterns across releases makes that progress measurable and visible beyond the test infrastructure team. Start with your worst coupling pattern, fix it in the page objects, enforce it in code review, and the next design system migration becomes a much quieter event.

Scale your Playwright tests with confidence.
Join hundreds of teams using Currents.

Learn More

Trademarks and logos mentioned in this text belong to their respective owners.