Measured Against Ground Truth — FactorPrism® Accuracy Benchmark

Why This Study Exists

Most Tools Can't Even See the Place Where the Cause Lives

Suppose margin fell because of something specific to Northeast Outerwear — one region's slice of one product line. In the tooling most teams actually use, that location is not a row on any screen. Slice by region and the problem smears across everything Northeast sells; slice by product and it smears across every region's outerwear. The cross-hierarchy intersection where the cause actually acts — the thing you need to see — is structurally invisible to one-dimension-at-a-time analysis. Tree-based screening functions touch such intersections, but emit them as overlapping fragments with no location in your hierarchy and no reconciliation to your number.

So the first question about any root cause tool is whether the answer is visible at all. And the second question follows immediately: when a tool does name a cause, how would you know if it's right? On real business data there is no answer key. The industry's track record here is poor, and documented: anomaly-detection platforms became famous for alert fatigue — hundreds of flagged segments, most of them noise — and automated insight tools surface "interesting" segments with no statement of accuracy at all.

This study answers both questions the only honest way: generate business data where the true causes are injected by construction — at known locations across the hierarchy lattice, including exactly the cross-hierarchy intersections ordinary tooling can't render — then run the production FactorPrism® engine against it blind, and score whether it finds each cause and the place where it acts. We also do something vendors rarely do: we report where performance degrades, and we publish the comparison against the naive method most tools (and most analysts) actually use.

Benchmark Design

140 Datasets, Known Causes, Blind Engine

⊞

Structure Two hierarchies (Product: category → subcategory; Geography: region → state), four children per level — 256 leaf segments, each starting at ~10,000 units. Two periods: before vs after, the canonical "why did the number move?" question.

⊕

Injected causes (the ground truth) 3 causes per dataset, each a ~5% effect applied at a randomly chosen node of the hierarchy — sometimes a whole category, sometimes a single state. The benchmark scores whether the engine finds both the effect and the location where it acts, with the correct direction.

≈

Noise sweep Every leaf also moves randomly between periods, swept from 0.1% (stable enterprise aggregates) to 15% (extremely volatile segments). 20 independently generated datasets per noise level, 7 levels — 140 datasets, each run through two configurations (280 runs, zero failures).

⚖

The comparison arm The same engine with significance gating disabled, keeping only a magnitude threshold — the standard "report every segment that moved more than X% of the change" approach used by dashboard tooling and manual analysis alike.

Results

The Numbers

A factor counts as a hit only if it names the injected location and the correct direction. Recall = injected causes recovered; recall@top-5 = ranked where a reader will actually look; precision = returned factors corresponding to a real cause; magnitude error = how far the estimated impact is from the injected one.

Period-over-period noise	Config	Factors returned (mean / max)	Recall	Recall@top-5	Precision	Median magnitude error
0.1%	FactorPrism®	4.4 / 12	100%	100%	94%	1.5%
0.1%	magnitude-only	6.8 / 52	100%	100%	94%	1.4%
0.5%	FactorPrism®	5.5 / 12	100%	100%	80%	2.0%
0.5%	magnitude-only	13.4 / 131	100%	100%	79%	2.0%
1%	FactorPrism®	7.6 / 12	100%	100%	62%	4.0%
1%	magnitude-only	27.9 / 175	100%	100%	48%	4.2%
2%	FactorPrism®	10.0 / 12	98%	95%	43%	8.2%
2%	magnitude-only	57.0 / 190	97%	95%	29%	8.2%
5%	FactorPrism®	12 / 12	95%	78%	35%	16.9%
5%	magnitude-only	106.2 / 226	97%	77%	21%	23.1%
10%	FactorPrism®	12 / 12	73%	58%	30%	36.1%
10%	magnitude-only	142.7 / 240	100%	57%	20%	49.7%
15%	FactorPrism®	12 / 12	63%	38%	25%	42.7%
15%	magnitude-only	159.1 / 249	100%	40%	19%	55.8%

20 datasets per noise level per configuration; engine at shipping defaults, no benchmark-specific tuning. Full method notes below.

What The Numbers Say

Four Findings

1. In its operating regime, the engine is essentially exact.

At the noise levels typical of established business aggregates (≤0.5% random period-over-period movement per segment), FactorPrism® recovered 100% of injected causes in 40 of 40 datasets, ranked every one in the top 5, kept precision at 80–94%, and estimated each cause's impact within ~2% of its true contribution — while returning a list of 4–6 factors. The answer isn't directionally right; it names the correct locations and gets the sizes right.

2. The factor list never floods.

This is the category's documented failure mode, and the headline result. By 2% noise, magnitude-only thresholding returns 57 factors on average (peaking near 200) — an alert flood in which the three real causes are items among dozens. FactorPrism®'s significance gate returns at most 12, with the same recall (98% vs 97%). The gate isn't discarding truth to look tidy; it's discarding segments whose movement carries no evidence of signal beyond their own volatility.

3. When data gets very noisy, no method can save a small effect — and honesty matters more than recall.

At 10–15% noise, a 5% cause sits at or below what two periods of data can mathematically distinguish. The magnitude-only arm still shows "100% recall" at 15% noise — but only because it returns ~160 factors; the truth is in the list the way a name is in a phone book (precision 19%, ranked placement collapsed to FactorPrism®'s level). No tool should claim to work in this regime. What a tool should do there is hold the line at a bounded, ranked dozen candidates instead of handing a business user 250 alerts.

4. Ranking degrades before it breaks.

Even past the comfortable regime, recoverable causes stay near the top: at 5% noise — already very volatile for period-over-period segment data — 78% of injected causes ranked in the top 5 of a 12-item list.

What This Means For Your Data

The Variable That Matters Is Stability, Not Size

Performance is governed by the period-over-period stability of your segments — not company size or row count. Quarterly revenue for an established business, aggregated to segment level, typically behaves like the left side of the table; daily data for tiny, sparse segments behaves like the right. FactorPrism®'s pre-flight checks and significance gating are built around exactly this: report what the data can support, and stay quiet past that line.

Two structural properties hold at every noise level, because they're guaranteed by construction rather than estimated: the returned factors always reconcile exactly to the total change being explained, and every factor is located — attached to the level of the business where it acts, separating broad-based forces from problems localized to one segment.

Method Notes & Limitations

The Fine Print, Voluntarily

~

Synthetic by necessity Ground truth doesn't exist otherwise. The generator is the same simulation framework we use for engine validation, not a showcase tuned for the engine: causes land at random hierarchy levels, in both directions, with random sizes.

~

Two-period comparison The canonical variance question. Multi-period analyses give the significance gate strictly more evidence to work with.

~

Production code path The engine seen by the benchmark is the code that ships in the Snowflake Native App — same defaults, no benchmark-specific configuration. Segments that launch from zero are handled by a separate mechanism (not exercised here).

~

Terminology In this study, "cause" refers to the injected ground truth and "factor" to what the engine returns — the benchmark scores the second against the first. In technical prose like this we use the two interchangeably; in the product itself every reported item is a factor, located at the place in the business where it acts.

Run it on your own numbers

FactorPrism® is a Snowflake Native App: the analysis runs entirely inside your Snowflake account — your data never leaves. Install and run the built-in demo, no data connection required.

Get it on Snowflake Marketplace