Methodology · Scanner accuracy

How we verify the scanner — 195 automated tests.

Open methodology means nothing if the implementation drifts from the published formula. We run an automated test suite on every change that catches the four kinds of bugs that would actually hurt users: scoring math errors, monotonicity violations (worse posture scoring better), false-clean fallbacks (scanner failed → score looks fine), and regressions on the bugs we've already fixed.

195

Tests in the suite

195 / 195

Passing · last commit

Test files

CI green

Build status

What the suite covers

Category	What it catches
Score math closure	Total score equals the sum of weighted component contributions; the weighting table itself sums to exactly 1.0; component contributions exactly equal weight × raw, rounded. Catches drift in the weighting table.
Monotonicity invariants (hand-picked + 1,400 property-based)	If posture B is strictly worse than posture A on any dimension, score(B) ≥ score(A). Hand-picked pairs cover known transitions; `fast-check` sweeps ~1,400 random combinations to find the cases we didn't think of.
Ground-truth canonical profiles	Hardcoded archetypes (modern PQC + clean keys + small surface / average enterprise / legacy RSA disaster) must score within plausible ranges (modern ≤ 3.5, disaster ≥ 7.5) and cross-profile ordering must hold (modern < average < disaster).
Regression pins for past bugs	Every bug we've ever found gets an explicit test that fails if the bug ever comes back. See the scoring changelog below for the bugs currently pinned.
Semantic contradiction tests	Outputs that should never co-occur. Example: if the RSA-fallback finding fires, the `keyExchange` component score must be in the high band. An expired cert must produce a high `certLifetime` raw. A 1-cert observation must NOT use "best-practice" rationale wording.
Weight dominance	Verifies the 50% weight on `keyExchange` actually dominates — perfect cert hygiene cannot hide catastrophic key exchange. Catches "score collapses to a single component" bugs.
Grade boundary sweeps	Every score → grade cutoff tested at the boundary. Catches off-by-one inclusive-vs-exclusive bugs in the grade mapping.
CT-failure semantic safety	When CT log enumeration fails, the scanner must NOT score that as a tiny attack surface. The fallback path uses honest "subdomain enumeration unavailable" rationale text and a neutral mid-band score — never raw=1.
SSRF defensive layer	Domain validation rejects IPv4 strings at the input layer; post-resolution check rejects hostnames that resolve to private (RFC 1918), loopback, link-local (incl. cloud metadata at `169.254.169.254`), CGNAT, or multicast addresses. 51 dedicated tests covering every reserved range.
Domain normalization	Strips `https://`, `www.`, path components, querystrings. Rejects IPs, bare TLDs, consecutive dots, unsafe characters, and label-length violations. SSRF-shaped inputs handled at the string layer before resolution.
Finding registry uniqueness + completeness	Every finding has a stable string ID that joins to the dispute table. Duplicate IDs would corrupt joins silently. Tests assert every rule has id + severity + title + area + `scoreImpact` — and that the area-to-impact mapping is consistent (TLS / cert / key-reuse findings are "weighted"; HTTP header / email / supply-chain findings are "advisory").
Observation history rollup math	The cumulative subdomain / script-host time series must be monotonically non-decreasing. `trackedSince` is the earliest `first_observed` across all observation tables. Distinct SPKI counting ignores nulls and duplicates.

Example scenarios — what the suite verifies on real archetypes

Example 1 — Modern ephemeral-only TLS, hybrid PQC, clean key rotation, small surface.
Expected: score ≤ 3.5, grade A or B. The suite asserts this archetype lands in the low-risk band on every commit. Catches refactors that accidentally over-penalize healthy domains.

Example 2 — RSA fallback accepted (downgrade-attackable) on otherwise-clean posture.
Expected: keyExchange raw ≥ 5 with the tls.rsa_kex_fallback finding firing. Hybrid PQC support does NOT override this — PQC is not a get-out-of-jail-free card if RSA is still accepted in the fallback path. The "Forward Secrecy Paradox" test verifies this scores worse than ECDHE-only with a long cert (the 50% weight on keyExchange dominates the 10% weight on certLifetime).

Example 3 — Single observed cert, no rotation history.
Expected: keyPersistence returns a neutral raw=3 with "insufficient CT-log history" rationale — NOT a low-risk raw=1 with "best practice" wording. Sparse data must look sparse, not pristine.

Example 4 — CT log enumeration times out / fails.
Expected: subdomainScale falls back to a neutral mid-band raw with honest rationale ("subdomain enumeration unavailable"). The wildcard signal still comes through via the TLS handshake's leaf cert. Failure does NOT collapse to a low-risk score.

Example 5 — Legacy RSA-only, TLS 1.0, expired cert, 5-year key reuse, 250-subdomain wildcard.
Expected: score ≥ 7.5, grade D or F. This is the canonical "disaster" archetype; the suite asserts the formula meaningfully separates it from healthy domains (gap ≥ 4.0 points minimum from the modern archetype above).

Scoring changelog — bugs we've found and fixed

Trust comes from owning the misses, not pretending the scanner was perfect from day one. Every bug listed here has a regression test pinned in the suite, so it can't silently come back.

May 2026 — Hybrid PQC no longer masks RSA fallback

Bug found 2026-05-12: a domain advertising hybrid PQC (X25519MLKEM768) but also accepting RSA fallback in its cipher list was being scored as if PQC alone was sufficient. Fixed: RSA fallback acceptance now produces a downgrade-attackable penalty independent of PQC status. PQC is not a get-out-of-jail-free card.

May 2026 — Sparse cert history no longer reads as "best practice"

Bug found 2026-05-12: a domain with only one cert observed in CT logs was being scored as if it had a "fresh key per cert renewal" — when in fact we had no rotation evidence at all. Fixed: keyPersistence now requires at least 2 cert observations before any favorable score; below that, it returns a neutral mid-band raw with explicit "insufficient history" rationale.

May 2026 — Subdomain attack-surface input validation tightened

Bug found 2026-05-12: isValidDomain('192.168.1.1') was returning true at the input layer, allowing IPv4-shaped strings into the probe pipeline. Fixed: explicit IPv4 rejection + per-label length cap (RFC 1035) + unsafe-character filter. SSRF-shaped strings are now blocked before any DNS resolution happens.

May 2026 — Post-resolution SSRF defensive layer added

Defensive gap closed 2026-05-12: previously, a legitimate-looking hostname resolving to a private/reserved IP (e.g., evil.example.com → 127.0.0.1 or → 169.254.169.254, the cloud metadata endpoint) would pass the input-string filter and be probed. Fixed: post-resolution check rejects any hostname whose A/AAAA records point into RFC 1918 private, loopback, link-local, CGNAT, or multicast ranges. 51 dedicated tests cover every reserved range.

What we deliberately don't publish

The test suite covers what's being checked. We don't publish the exact threshold values (e.g., "what subdomain count triggers the wildcard escalation"), the full assertion bodies, or the canary domains we use for live monitoring. Publishing those would let a competitor reverse-engineer the formula by tuning their scanner until output matched ours — without ever earning the trust signal of an open methodology.

What IS public, and remains public:

The full formula and weights (in /methodology/score-components)
The categories of tests run on every commit (this page)
The bug-fix changelog (this page, above)
The CLI source and tests in the public cipherwakelabs/pqcheck repo

This page is updated whenever the test count changes meaningfully or a new bug is fixed. Last update: 2026-05-12 · 195 tests · 12 test files · all passing. The bigger goal is alignment with how SSL Labs and Mozilla Observatory operate: methodology is open, implementation drift is publicly tracked, and missed cases get owned, not buried.