Article
When Your Performance Dashboard Lies: RUM Pitfalls to Avoid
·11 min read
A practical deep dive into RUM sampling bias, percentile traps, and why your numbers may disagree with CrUX.
A familiar story: dashboards look green, release goes out, and users still complain the app feels slow. Most of the time, this is not because RUM is useless. It is because we are reading it incorrectly.
I have seen this happen after seemingly successful launches: p75 LCP looked stable on the main dashboard, but support tickets spiked from Android users on weaker networks. The issue was real, the dashboard just did not represent that segment well.
RUM is only as good as the population you capture and the way you aggregate it. Small mistakes in sampling and percentiles can make a healthy graph hide a bad user experience.
Pitfall 1: Sampling bias hides the users who struggle most
Many teams unintentionally sample the easiest sessions: fast devices, stable networks, and logged-in users who stay longer. The slow edge cases are often underrepresented or completely missing.
A classic example is consent-gated analytics. Users who accept quickly are often from faster devices and stronger connections. Users who bounce early, struggle with heavy first-load JS, or never complete consent are the same users most likely to have poor performance, and they disappear from the dataset.
- Script blocked by ad blockers, consent flows, or strict network policies.
- Users bounce before your RUM SDK initializes.
- Low-end phones drop events due to memory pressure.
- Sampling rates differ between pages or environments.
- Only authenticated traffic is measured, while anonymous traffic is ignored.
If your worst users are the least likely to be measured, your p75 will look artificially good.
On one storefront migration, we tracked LCP only after user authentication because that was where product analytics already existed. Homepage and category pages looked fine in RUM, but revenue dropped. Later we discovered anonymous traffic on landing pages had much worse load performance and was never being measured.
Pitfall 2: Percentile traps
Median (p50) is usually too optimistic for UX decisions. Core Web Vitals are evaluated at p75 for a reason: the slow quarter of your users matters.
Another common issue is averaging page-level percentiles. Percentiles are not additive. Averaging p75 values across routes can create numbers that have no statistical meaning.
// ❌ Wrong: averaging p75 from each page
const dashboardLcp = average([
p75(homeLcp),
p75(searchLcp),
p75(productLcp),
]);
// ✅ Better: combine all page samples first, then compute p75
const dashboardLcp = p75([
...homeLcpSamples,
...searchLcpSamples,
...productLcpSamples,
]);If one route has 10x traffic, it should contribute 10x weight. Sample-level aggregation handles that naturally. Route-level averaging does not.
Another trap is mixing very different journeys into one line. Imagine home page, search results, and checkout all bundled into one p75. A checkout regression can be severe but invisible because homepage traffic dominates volume.
Example:
- Home: p75 INP = 140ms, 900k samples
- Checkout: p75 INP = 420ms, 40k samples
Global blended line may still look "acceptable",
while the revenue-critical path is clearly degraded.Pitfall 3: CrUX and internal RUM disagree (and both can still be correct)
Teams panic when CrUX p75 LCP is worse than internal RUM. Usually this is a scope mismatch, not a data bug.
- Population: CrUX is Chrome users with eligible data; your RUM depends on your SDK coverage.
- Scope: CrUX can be origin-level or URL-level; your dashboard may be route templates.
- Window: CrUX uses a rolling 28-day window; your chart might be last 24h or 7d.
- Filtering: country/device/network filters may differ by default.
- Instrumentation: custom LCP handling may not match standard definitions.
If you compare non-equivalent populations, every conclusion after that is shaky.
In practice, I treat CrUX as an external truth signal and internal RUM as operational detail. If CrUX worsens but internal RUM looks flat, that is usually a hint that your internal coverage is missing a segment, not that CrUX is wrong.
A practical reconciliation framework
Before comparing numbers, force both views into the same shape. This removes 80% of confusion in performance reviews.
- Compare p75 to p75 only.
- Use similar time windows (or annotate when they differ).
- Segment by device class, country, and connection type.
- Separate high-traffic routes from long-tail routes.
- Track both inclusion rate and metric values.
Frontend RUM metrics that are worth tracking
- LCP (p75): loading speed of the main content for real users.
- INP (p75): responsiveness under real interaction pressure.
- CLS (p75): visual stability during page lifecycle.
- TTFB (p75): backend + network delay as seen by the browser.
- Navigation type split: fresh load vs bfcache restore vs back/forward.
- Long tasks count and total blocking time per session.
- Resource timing outliers (largest JS/CSS/image contributors).
- Coverage metric: % sessions where vitals were successfully captured.
Alongside vitals, attach context fields at capture time: route group, device class, connection type, country, and release version. Without these dimensions, debugging is mostly guesswork.
import { onLCP, onINP, onCLS, onTTFB } from "web-vitals";
function sendRum(metricName, value, id) {
const nav = performance.getEntriesByType("navigation")[0];
const payload = {
metricName,
value,
id,
path: window.location.pathname,
routeGroup: getRouteGroup(window.location.pathname),
releaseVersion: window.__APP_VERSION__,
deviceMemory: navigator.deviceMemory ?? null,
connectionType: navigator.connection?.effectiveType ?? "unknown",
navigationType: nav?.type ?? "unknown",
};
navigator.sendBeacon("/rum", JSON.stringify(payload));
}
onLCP((metric) => sendRum("LCP", metric.value, metric.id));
onINP((metric) => sendRum("INP", metric.value, metric.id));
onCLS((metric) => sendRum("CLS", metric.value, metric.id));
onTTFB((metric) => sendRum("TTFB", metric.value, metric.id));This quickly exposes where the pain actually is. In many products, desktop p75 is fine while low-end Android in specific regions is severely degraded.
One useful ritual is to review vitals by business-critical slice every week: first-time visitors, checkout funnel, and low-end mobile. This catches issues earlier than a single blended dashboard.
Instrumentation checklist that prevents misleading dashboards
- Record a lightweight event as early as possible to estimate RUM script coverage.
- Log sample inclusion/exclusion reason (blocked, consent denied, no vitals event, etc.).
- Store context: route, device class, effective connection type, country, and release version.
- Keep metric definitions stable; version them when you change calculation logic.
- Alert on both metric regression and sudden drop in sample volume.
A dashboard should answer two questions at once: “How fast are we?” and “Who are we missing?” If you only answer the first, you can still ship regressions confidently.
If I had to pick one change with the highest ROI: add an explicit coverage panel next to vitals. Show how many sessions were eligible, measured, excluded, and why. Teams make better performance decisions when uncertainty is visible.
How frontend teams can apply this in practice
For frontend engineers, this becomes most useful when it is part of delivery workflow instead of a separate analytics exercise. A simple operating model works well in most teams:
- Instrument vitals early in page lifecycle, especially on anonymous and landing routes.
- Track measurement coverage with explicit exclusion reasons (blocked script, consent, early bounce, runtime error).
- Use p75 by segment (device class, country, route group) instead of one blended global chart.
- Keep business-critical journeys isolated in dashboards (signup, checkout, payment success path).
- Attach release version to RUM events so regressions map to deploys quickly.
- Review CrUX and internal RUM together on a fixed cadence, with aligned time windows and filters.
In code reviews, it helps to include a short performance note: expected LCP/INP/CLS impact, likely affected segment, and rollout guardrails. This habit keeps performance ownership close to feature work.
Closing note
RUM is one of the best tools we have for real-user performance, but only if we treat it as measurement science, not just pretty charts. Good sampling and correct percentile math turn performance work from debate into engineering.