Open methodology means nothing if the implementation drifts from the published formula. We run an automated test suite on every change that catches the four kinds of bugs that would actually hurt users: scoring math errors, monotonicity violations (worse posture scoring better), false-clean fallbacks (scanner failed → score looks fine), and regressions on the bugs we've already fixed.
| Category | What it catches |
|---|---|
| Score math closure | Total score equals the sum of weighted component contributions; the weighting table itself sums to exactly 1.0; component contributions exactly equal weight × raw, rounded. Catches drift in the weighting table. |
| Monotonicity invariants (hand-picked + 1,400 property-based) | If posture B is strictly worse than posture A on any dimension, score(B) ≥ score(A). Hand-picked pairs cover known transitions; fast-check sweeps ~1,400 random combinations to find the cases we didn't think of. |
| Ground-truth canonical profiles | Hardcoded archetypes (modern PQC + clean keys + small surface / average enterprise / legacy RSA disaster) must score within plausible ranges (modern ≤ 3.5, disaster ≥ 7.5) and cross-profile ordering must hold (modern < average < disaster). |
| Regression pins for past bugs | Every bug we've ever found gets an explicit test that fails if the bug ever comes back. See the scoring changelog below for the bugs currently pinned. |
| Semantic contradiction tests | Outputs that should never co-occur. Example: if the RSA-fallback finding fires, the keyExchange component score must be in the high band. An expired cert must produce a high certLifetime raw. A 1-cert observation must NOT use "best-practice" rationale wording. |
| Weight dominance | Verifies the 50% weight on keyExchange actually dominates — perfect cert hygiene cannot hide catastrophic key exchange. Catches "score collapses to a single component" bugs. |
| Grade boundary sweeps | Every score → grade cutoff tested at the boundary. Catches off-by-one inclusive-vs-exclusive bugs in the grade mapping. |
| CT-failure semantic safety | When CT log enumeration fails, the scanner must NOT score that as a tiny attack surface. The fallback path uses honest "subdomain enumeration unavailable" rationale text and a neutral mid-band score — never raw=1. |
| SSRF defensive layer | Domain validation rejects IPv4 strings at the input layer; post-resolution check rejects hostnames that resolve to private (RFC 1918), loopback, link-local (incl. cloud metadata at 169.254.169.254), CGNAT, or multicast addresses. 51 dedicated tests covering every reserved range. |
| Domain normalization | Strips https://, www., path components, querystrings. Rejects IPs, bare TLDs, consecutive dots, unsafe characters, and label-length violations. SSRF-shaped inputs handled at the string layer before resolution. |
| Finding registry uniqueness + completeness | Every finding has a stable string ID that joins to the dispute table. Duplicate IDs would corrupt joins silently. Tests assert every rule has id + severity + title + area + scoreImpact — and that the area-to-impact mapping is consistent (TLS / cert / key-reuse findings are "weighted"; HTTP header / email / supply-chain findings are "advisory"). |
| Observation history rollup math | The cumulative subdomain / script-host time series must be monotonically non-decreasing. trackedSince is the earliest first_observed across all observation tables. Distinct SPKI counting ignores nulls and duplicates. |
keyExchange raw ≥ 5 with the tls.rsa_kex_fallback finding firing. Hybrid PQC support does NOT override this — PQC is not a get-out-of-jail-free card if RSA is still accepted in the fallback path. The "Forward Secrecy Paradox" test verifies this scores worse than ECDHE-only with a long cert (the 50% weight on keyExchange dominates the 10% weight on certLifetime).
keyPersistence returns a neutral raw=3 with "insufficient CT-log history" rationale — NOT a low-risk raw=1 with "best practice" wording. Sparse data must look sparse, not pristine.
subdomainScale falls back to a neutral mid-band raw with honest rationale ("subdomain enumeration unavailable"). The wildcard signal still comes through via the TLS handshake's leaf cert. Failure does NOT collapse to a low-risk score.
Trust comes from owning the misses, not pretending the scanner was perfect from day one. Every bug listed here has a regression test pinned in the suite, so it can't silently come back.
Bug found 2026-05-12: a domain advertising hybrid PQC (X25519MLKEM768) but also accepting RSA fallback in its cipher list was being scored as if PQC alone was sufficient. Fixed: RSA fallback acceptance now produces a downgrade-attackable penalty independent of PQC status. PQC is not a get-out-of-jail-free card.
Bug found 2026-05-12: a domain with only one cert observed in CT logs was being scored as if it had a "fresh key per cert renewal" — when in fact we had no rotation evidence at all. Fixed: keyPersistence now requires at least 2 cert observations before any favorable score; below that, it returns a neutral mid-band raw with explicit "insufficient history" rationale.
Bug found 2026-05-12: isValidDomain('192.168.1.1') was returning true at the input layer, allowing IPv4-shaped strings into the probe pipeline. Fixed: explicit IPv4 rejection + per-label length cap (RFC 1035) + unsafe-character filter. SSRF-shaped strings are now blocked before any DNS resolution happens.
Defensive gap closed 2026-05-12: previously, a legitimate-looking hostname resolving to a private/reserved IP (e.g., evil.example.com → 127.0.0.1 or → 169.254.169.254, the cloud metadata endpoint) would pass the input-string filter and be probed. Fixed: post-resolution check rejects any hostname whose A/AAAA records point into RFC 1918 private, loopback, link-local, CGNAT, or multicast ranges. 51 dedicated tests cover every reserved range.
The test suite covers what's being checked. We don't publish the exact threshold values (e.g., "what subdomain count triggers the wildcard escalation"), the full assertion bodies, or the canary domains we use for live monitoring. Publishing those would let a competitor reverse-engineer the formula by tuning their scanner until output matched ours — without ever earning the trust signal of an open methodology.
What IS public, and remains public:
This page is updated whenever the test count changes meaningfully or a new bug is fixed. Last update: 2026-05-12 · 195 tests · 12 test files · all passing. The bigger goal is alignment with how SSL Labs and Mozilla Observatory operate: methodology is open, implementation drift is publicly tracked, and missed cases get owned, not buried.