Andrew GoldisJun 27, 2024

Cypress vs Playwright - Flakiness Analysis

Comparing Cypress and Playwright Test Flakiness Using Currents Data.

"What testing framework is more flaky - Cypress or Playwright?"

Last month, we shared a poll that sparked an interesting conversation and valid questions by the community members. The poll is over, and we can share more results based on a simple analysis of close to 400 million test records and 100s of real-world applications.

Poll Results

What testing framework is more flaky: Cypress or Playwright?

Now, let's see if the data agrees with the poll results.

Discussion

We agree that flakiness cannot be solely attributed to the testing framework. A test may be unstable due to various factors, including:

The testing framework
The CI environment and infrastructure
The application being tested

So, how can we claim that a particular testing framework is less flaky than another if there are so many aspects in play? Well, if we have a sufficiently large data set from different environments and applications (all things combined), we can guess the stability of the testing framework with a certain degree of statistical correctness.

To be fair, the more appropriate question we are trying to answer is:

"What flakiness did we record across different apps and environments using Cypress vs Playwright?"

Results

We have collected millions of test records from both Cypress and Playwright. This data comes from 100s of real-world projects of various sizes and complexities, as well as different CI providers and setups.

To calculate the metrics, we took an anonymized subset of recorded test results:

We only used projects with more than 500 overall test records;
We calculated the flakiness rate per project;
We calculated metrics: average flakiness rate, p99, p95, p90, p75 and p50;
We removed the outliers with unusually high flakiness rates using IQR method;

Here's what we've got:

Cypress

Average flakiness rate: 0.83%
P99: 4.2%
P95: 3.18%
P90: 2.31%
P75: 1.25%
P50: 0.42%
number of records included: 318,299,516
number of projects included: 609

Playwright

Average flakiness rate: 0.72%
P99: 2.56%
P95: 1.91%
P90: 1.68%
P75: 1.09%
P50: 0.60%
number of records included: 60,176,970
number of projects included: 285

Conclusion

On average, teams are experiencing 15.60% less flakiness with Playwright (0.72%) compared to Cypress (0.83%) - the absolute difference is 0.11% in favor of Playwright, the relative difference is 15.60% in favor of Playwright;
If we keep the outliers, we get a more extreme difference: Playwright 0.88%, Cypress 1.84% (see below);
Cypress not only has a higher average flakiness rate but also a wider spread (as evidenced by P95 and P99) - i.e. Cypress projects experienced more variability in flakiness;
Cypress data had a more extended tail towards higher flakiness rates, indicating more extreme values than Playwright (even after removing the outliers).

A few notes on the quality of data:

We included more Cypress projects (609 vs. 285) with a larger overall number of records - it represents a broader and more diverse set of data;
A good portion of Playwright tests were generated by teams that switched from Cypress to Playwright and used this opportunity to refactor their testing suite according to their need - which can contribute to a lower flakiness rate for Playwright;

Summary

In general, teams are quite good at managing test flakiness, with many projects having no flakiness at all or below 1%.
Teams can eliminate flaky tests completely with both Cypress and Playwright.
On average teams using Playwright have a bit lower chance to experience flakiness.

Misc

What is a flaky test?

Cypress - a test record that didn’t pass on the first attempt;
Playwright - a test record that produced a non-expected outcome for the first attempt with more 1 attempts recorded, excluding skipped tests;
We only measured flakiness from within a single execution, based on multiple attempts - i.e. we didn't count restarting the CI pipeline and getting "passed" results for a previously failed test.

But Cypress Architecture assumes less flakiness

Yes, in theory. In practice, Cypress heavily relies on its NodeJS layer for networking (for interceptions and mocking); also, Cypress uses a mix of CDP and in-browser code.

Results with outliers included

Flakiness results with IQR outliers included

At Currents, we help measure, monitor, and resolve test flakiness. Give us a try!

Join the growing community of teams using Currents for their Cypress and Playwright tests

Trademarks and logos mentioned in this text belong to their respective owners.