Performance endpoints:

Probe Success? Criterion Actual
COMPOSITE_TIME Median per-user fraction of slow frames < 0.5% (absolute) 0.15% slow
CONTENT_FRAME_TIME_VSYNC ≤ 5% regression in median of per-user fraction of slow events No difference
CONTENT_FULL_PAINT_TIME ≤ 5% regression in median fraction of slow paints (> 16 ms) 50.2-52.0% improvement
CONTENT_FULL_PAINT_TIME ≤ 5% regression in median of per-user means 8.9-9.9% regression
FX_PAGE_LOAD_MS_2 ≤ 5% regression in median of per-user means No difference
FX_TAB_SWITCH_COMPOSITE_E10S_MS ≤ 5% regression in median of per-user means 2.1-2.7% improvement
CHECKERBOARD_SEVERITY ≤ 5% regression in rate of severe checkerboarding events per usage hour 3.2% regression

Stability endpoints:

Endpoint Success Criterion Actual
Overall crash reports ≤ 5% increase in crash rate 2.8% increase in crash rate
OOM crash reports ≤ 5% increase in crash rate 12% decrease in OOM crashes
CANVAS_WEBGL_SUCCESS ≤ 5% regression in median of fraction “True” per user No difference
DEVICE_RESET_REASON ≤ 5% increase in reset rate 57% decrease in device resets

The higher crash rate in the WebRender branch is attributable to an increase in the rate of GPU process crashes. Main and content process crash rates fell.

Retention and engagement metrics were not affected.

1 Introduction

WebRender is a new technology for getting webpages onto the screen using a GPU. In this experiment, we enabled WebRender for users in the Firefox 67 release channel running Windows 10 with certain Nvidia GPU chipsets.

This experiment followed a [very similar experiment][https://mozilla.report/post/projects/webrender-release-66/index.html] in release 66, and served as a monitoring canary for a simultaneous staged rollout that delivered WebRender to all Windows 10 desktop users with an allowlisted GPU model during the 67 release cycle.

We have been running a separate ongoing experiment in the beta and nightly channels to guide development, observing how performance changes on a build-by-build basis. This report does not describe that work.

2 Results

2.1 Performance

Before computing results for performance endpoints, user sessions were filtered to ensure that the compositor for the telmetry session matched the enrolled branch. The first telemetry session after a user enrolled was dropped for users in both branches, because the user needs to restart the browser after enrolling for WebRender to be enabled. (The enrollment session was chosen by identifying the session containing the lowest profile_subsession_counter for each client_id.) Users who unenrolled from the experiment were excluded after unenrollment.

This avoids a minimizing bias that could result from contaminating the results for the treatment branch with results from users that were not exposed to the treatment. The approach may overestimate the actual effect of WebRender on the population if a non-random set of users (e.g. users with poor performance) were more likely to unenroll from the experiment, but this is unlikely because unenrollments were rare, and balanced between the experiment and control branches (see “Enrollment” below).

2.1.1 Continuous endpoints

Metric Median per-user mean, as WR % of Gecko 95% CI (low) 95% CI (high)
content_full_paint_time 108.60 108.06 109.14
page_load_ms 100.08 99.48 100.66
tab_switch_composite 97.60 97.32 97.90

Median per-user mean values of content_full_paint_time were about 8% higher in the WebRender case. Page load times did not change. Tab switch time decreased a little more than 2%.

The distribution of each metric will be discussed in the following subsections, which is a pattern that will continue in this report.

2.1.1.1 Content paint time

WebRender users tended to have a somewhat higher average CONTENT_FULL_PAINT_TIME, though WebRender users were less likely to have slow (> 16 ms) events (discussed below).

2.1.1.2 Page load time

The per-user-mean page load time distributions were essentially identical between WebRender and Gecko users.

2.1.1.3 Tab switch time

The median per-user average tab switch was slightly faster with WebRender. The fastest tab switches took longer with WebRender enabled, but the slowest tab switches took less time.

2.1.2 Thresholded absolute endpoints (composite time)

The criterion for COMPOSITE_TIME was that the median per-user slow fraction should be < 0.5%.

The median fraction of slow composites is much higher in the WebRender branch compared to the Gecko branch, but lower than the 0.5% threshold:

Branch Median per-user slow composites (percent) 95% CI (low) 95% CI (high)
WebRender 0.1493 0.1477 0.1511
Gecko 0.0146 0.0144 0.0148

2.1.3 Thresholded comparative endpoints

Metric Median per-user fraction, as WR % of Gecko 95% CI (low) 95% CI (high)
content_frame_time_vsync (> 200) 100.21 99.21 101.06
content_full_paint_time (> 16) 48.87 48.13 49.63

The mean per-user fraction of slow content_frame_time_vsync events was similar in the WebRender branch.

The mean per-user fraction of slow content_full_paint_time events was about halved in the WebRender branch.

2.1.3.1 Content frame time

The median WebRender and Gecko user experienced very similar fractions of slow CONTENT_FRAME_TIME_VSYNCs (> 200% vsync).

The fraction of slow frames was somewhat lower for the fastest half and higher for the slowest half of WebRender users compared to Gecko.

2.1.3.2 Content paint time

The median WebRender user experienced considerably fewer slow paints (> 16 ms) than the median Gecko user.

The worst-performing 20% of users in the WebRender and Gecko branches had similar slow paint fractions.

2.1.4 Checkerboarding

Checkerboarding refers to artefacts caused during scrolling when paints during successive frames of the scroll event are incomplete. The CHECKERBOARD_SEVERITY probe measures the area of the underpainted region times the duration of the event in arbitrary units (au).

Based on the observed distribution of the metric, I took 500 au as an empirical threshold for “severe” checkerboarding events. Many users will eventually encounter a severe event, but they are infrequent enough that estimating a per-user frequency with precision is difficult.

Instead, I present the rate per 1,000 usage hours over the population:

This shows a 3% excess of severe checkerboarding events in the WebRender branch. Error bars are 95% CIs for Poisson events.

2.2 Stability

Sessions were filtered for stability in the same manner as for performance.

Despite a clear increase in GPU process crashes, the overall crash rate was only slightly higher because the number of main and content process crashes fell.

2.2.1 Overall crash reports

2.2.2 Per-process crash reports

2.2.3 OOM crash reports

OOM crashes are a subset of main process crashes. They were less common in the WebRender branch.

2.2.4 WebGL canvas construction

Failure to create a WebGL canvas was rare in either branch. This is reflected in the per-user average fraction of canvas creation successes:

branch average_success_fraction
Gecko 0.9995051
WebRender 0.9996397

2.2.5 Device resets

2.3 Engagement

Retention and engagement metrics were observed for all enrolled users from the moment of enrollment; filtering was not performed to ensure that the compositor matched the enrolled branch, and enrollment sessions were not discarded.

2.3.1 URI count

There was a small decline in the number of URIs visited by the least active users.

Computing bootstrapped 95% confidence intervals for the difference between the distributions along the curve, the 10th and 25th percentiles of the userbase in each branch reflected less usage in the WebRender branch:

A 10% decrease at the 10th percentile corresponds to about 1 fewer URI. A 2.5% decrease at the 25th percentile corresponds to a shift from 300 to 293 URIs.

2.3.2 Active time

The distribution of per-user active time also showed a slight decrease for less active users:

Active time may have decreased slightly for WebRender branch users among less avid users.

2.3.3 Total time

Distribution of total browser-open time also may have shown a small decrease for less avid users.

Similar to active time, less-avid users may have used the browser slightly less in the WebRender branch.

2.4 Retention

Retention was similar between the study branches.

Retention may have been slightly lower for the WebRender branch at 3 weeks. The 95% confidence interval for the true difference between the branches was at least -0.03% and at most 0.56%.

2.5 Enrollment

Daily enrollment and unenrollment were symmetric between branches.

Enrollment was exaggerated because the recipe was not written to filter by wrQualified status.

Unenrollments were minimal and distributed equally between branches.

3 Conclusions

  • The WebRender experiment met all but one of the performance goals. Although the median per-user mean CONTENT_FULL_PAINT_TIME increased, the number of measurements greater than 16 ms (=1/60 Hz) actually decreased. Because most users have a 60 Hz refresh rate, this may not be a generally user-visible regression.
  • The WebRender experiment had generally salutary effects on stability, except for an increase in GPU process crashes. Main process and content process crashes, which are more visible to the user, decreased.
  • The WebRender experiment did not have clear impacts on user engagement or retention, although there may have been a small decrease in usage, as measured by active hours, URIs visited, and total session time among the least avid users in the experiment.

4 Methods

The pref-flip-webrender-perf67-1526094 experiment enrolled users in Firefox 66 who met the normandy.telemetry.main.environment.system.gfx.features.wrQualified.status == 'available' criterion. At the time of the study, this enrolled users running Windows 10 on systems without a battery and having one of a list of allowlisted graphics cards.

ETL was computed by two notebooks: