Measuring the shift

Anyone can prompt a model to play a CISO — conjure a hundred personas, ask if they’d buy, screenshot the percentage. It looks like proof, but nobody checked the number against a real one. The only part worth anything is measuring how far that answer sits from where a real CISO would land, then closing the gap on purpose. That’s how Príncipe is built, end to end.

Omer Grossman · Builder · Cybersecurity exec · ex-CyberArk, ex-IDF
Two decades in cybersecurity. Defended nations and organizations.

A synthetic panel is worth exactly what you’ve measured it against, and not one degree more. That’s the bar I hold every part of Príncipe to — and it’s the whole line between an instrument and a clever prompt.

We won’t tell you whether your idea is right. We’ll show you where the sky shifts — and where it doesn’t.

One prompt vs. a measured panel

Before the how, the what-you-get. Point the same security question at a single model told to “act as a CISO,” then at Príncipe. Five things come out different — and they’re the five that decide whether you can actually bet on the answer.

What you’re judging	Prompt one LLM as a CISO	Run it through Príncipe
The room	One averaged, agreeable voice.	30–200 CISOs built to disagree — you see where the room actually splits.
The number	Unchecked; nobody measured it against a real buyer.	Corrected against real CISO data, residual error reported — mean error fell 47 → 18 points.
Confidence	Equally certain about everything.	A sized confidence band, and a “directional” flag where it isn’t calibrated yet.
The answer	A vibe and a percentage.	A stance, the ranked objections that block it, and the most-opposed segment.
Trust	A black-box one-off you can’t inspect.	Open source and reproducible — every persona, correction, and statistic is checkable.

None of that is a cleverer prompt; it’s an instrument with an error model. The rest of this piece is how each of those rows actually gets built — typed, populated, calibrated, and kept honest.

The shift we started with

The naive version — one prompt, one panel, ask anything — we measured against public CISO surveys where the real answer is already known. It was off by a mean of ~47 percentage points, and the error wasn’t noise. It was framing-dependent: the same panel that over-rejected bold pitches would over-affirm “is this a priority?” and over-hedge a forecast. One prompt can’t correct three opposite biases at once. So we stopped asking the panel a question and started routing it through a pipeline where each stage has exactly one job.

The method, end to end

Type the question, correct its framing, answer it with a panel built to disagree, then calibrate — and hand back a confident number only when the data has earned a tight band. Until then: a directional read, wide band, objections first.

Router — classify before you answer

Every question is first typed: PITCH, STRATEGY, PRIORITY, FORECAST, or FACTUAL. Heuristics first, a small model call only when it’s genuinely ambiguous, and it never blocks the panel. Type is the lever everything downstream pulls — because the bias is type-specific.

Type skill — correct the framing at the source

Each persona’s base prompt defines “agree” in pitch terms. For any other type that definition is wrong and quietly wins — it’s why a factual “do you use AI?” once came back 2% when 89% of real orgs do. So a non-pitch question first revokes the pitch framing and installs the right one.

1·5

Review pass — stress-test the objections

Between the panel and the map, three reviewers from different seats interrogate the result: which objection actually blocks the deal, is the majority even defensible — and the part that earns its keep, what did the whole panel miss? It’s the peer-review round a real council does, and it’s where the “what the panel almost missed” line in the example below comes from.

Calibration map — correct the number, size the band honestly

A per-type correction learned from paired (panel, real) points, with a confidence band drawn from the residual. It’s gated: it only calls a type “calibrated” with enough data and a tight enough band. Otherwise the answer comes back directional — wide band, no false precision.

A standing rule: every number is computed server-side from the actual votes; the model only writes prose. A stance label can never quietly contradict the percentage beside it.

Who’s in the room

The panel is variable-N, 30 to 200. Thirty is a hard floor — below it the result is statistically meaningless and the product refuses to pretend otherwise. Each synthetic CISO is assembled deterministically, so a composition is perfectly reproducible — table stakes for calibrating anything.

Identity — who they are

Region

US, EU-West, UK, EU-Central, APAC, ANZ, MEA — weighted to a realistic global mix, re-weightable per study.

Industry

24 buyer segments (GICS-derived, split for security: fintech vs banks, B2B SaaS vs consumer, gov/edu, healthcare, OT-heavy verticals).

Company size & tenure

150 employees to 20k+, budgets that scale with it; 3 to 15+ years in the chair.

Background

ex-engineer, ex-Big-4, ex-regulator, ex-military/intel, ex-founder, ex-pentester, or career CISO.

Disposition — what makes the room disagree

Identity alone produced a monolith: a uniformly skeptical panel that collapsed to 0% or 100% where real CISOs split down the middle. Real security leaders live in a tension — enable the business and defend it — and they resolve it differently. We model that on three independent axes.

Stance · cautiousbalancedaggressivecontrarian

How hard they interrogate the evidence.

Posture · enablement-firstpragmaticsecurity-purist

Their security-vs-business worldview, and how confident and resourced their org is.

AI posture · forwardpragmaticskeptic

How far they trust AI to act on its own — the sharp edge of every AI-security pitch.

Mandate · reactiveembeddedstrategic

Their organisational authority and program maturity — a reactive bolt-on with little leverage, or a board-level shaper who can drive change org-wide. It’s the difference between answering “could I execute this?” and “would I commit to it?” — and it’s why the same question splits a real room.

Each persona is grounded in the same 2026 reality — the funding flood, AI as the stated top priority, identity as the dominant attack surface, tightening budgets — as context to react to from their own seat, never a consensus to recite.

And that reality isn’t frozen. A signed knowledge feed updates the panel every day — the latest breaches and incidents, new regulation and guidance, the week’s vendor and threat movements — distilled, never pasted, and tagged by region, industry, and category so it lands on the personas it actually bears on. That tagging is the same ontology: the dimensions that decide who’s in the room also decide what each of them read this morning. A healthcare CISO in the EU reacts to the breach and the directive that touch their seat; an APAC fintech CISO reacts to theirs. The sky you’re measuring against is today’s, not last year’s — which is the only way a synthetic answer stays grounded as the real one moves.

Five kinds of question, five kinds of “yes”

This is the trap the router exists to avoid. “Yes” is not one thing — it’s five. When a CISO says they’re in favour, the panel has to know which “yes” they mean: I’d buy this is a different measurement from I back this direction, from this beats my other priorities, from I predict it’ll happen, from that’s already true of my org. Read the third column across — it’s five genuinely different questions wearing the same word.

Type	The question really asks…	…so “in favour” means
`PITCH`	Would you adopt / buy this?	You’re willing to pursue it
`STRATEGY`	Is this the right approach?	You back the direction
`PRIORITY`	Is X a priority / where to invest?	It beats your other demands
`FORECAST`	Will X happen, by when?	Your best prediction is “yes”
`FACTUAL`	Do you already do X?	It’s true of your org today

A panel that answers all five with one notion of “agree” is confidently wrong before it starts — that single confusion is most of why the naive version sat 47 points off. Type the question first, and each “yes” gets measured as the thing it actually is.

What the answer looks like

You don’t get a vibe and a number. You get a decision, the split that produced it, the objections that block it, and a statistical read on whether the panel was even the right shape for the question. A real example — a “would you let an autonomous AI auto-close tier-1 alerts in production?” pitch, run against a 50-CISO panel:

What CISOs push back on

1.Auto-closing alerts removes the human judgment our regulators expect on incident handling — that’s a board conversation, not a config toggle.
2.The six-week integration collides with an in-flight IdP migration we can’t pause.
3.Pre-Series-A vendor on a control this central — we’d need source escrow and a continuity plan before signing.

Most opposed: Financial Services (region) — 71% con (n=14)

What the panel almost missed

No objection raised the incident-attribution gap: if the AI auto-closes an alert that turns out to be a real breach, who owns the call in the post-mortem? Surfaced by the review pass, not the panel.

Directional read

Lean No 38% in favour

19 9 22

pro 19 neutral 9 con 22

Confidence: Moderate · 95% CI 25–51% (±13pp · N=50)

WARN

Statistically thin sample for this question

The panel composition is workable but coverage is uneven. Verdicts are usable, but the credible interval is wide.

KL divergence 0.18 95% credible interval 0.26–0.51 recommended N 80

One pitch, one panel: the objections lead, the number is explicitly directional, and the statistics say plainly whether this panel could even answer the question.

Notice the hierarchy. The objections lead — sharp, specific, segment-attributed — because for a question like this they’re the most decision-useful thing in the room. The stance and percentage sit just beneath, with the band sized to the confidence behind them. You always know exactly how much weight the number can bear.

Calibration — three legs, no shortcuts

“Calibrated” isn’t a feeling. We pin it to three independent legs, and we fix the numeric tolerance after a baseline run, never by guessing the number we’d like.

Distributional — against public surveys

Answer distributions checked against a corpus of real CISO surveys (Proofpoint Voice of the CISO, Foundry, Cisco Readiness Index, Glilot, and more). Are we in the right neighbourhood?

On-task — against real CISOs, same questions

We put the exact questions to a live panel of real CISOs and compare answer for answer. The harshest test, and the one that produces the uncomfortable numbers.

Historical — against the past

Run the panel on technologies whose outcome we already know — EDR, Zero Trust, MFA — with leakage controls. Would it have called the winners?

What the loop actually moved

On attitudinal, priority, and factual questions, four measured passes — type-aware framing, a persona disposition axis, an AI-autonomy axis, and 2026 grounding — took mean absolute error from 47 to 18 points, a 62% cut, with the error spread evenly across types instead of piled into one. The panel now splits where real CISOs split, instead of collapsing to a unanimous 0% or 100%.

Naïve single prompt46.5

+ framing correction33.1

+ disposition axis22.9

+ AI axis & grounding17.7

Mean absolute error (pp) vs known answers, across a 10-question multi-survey set. Lower is better.

Calibrated — and kept that way

Calibration isn’t a one-time stamp. The real world moves — new attacks, new tools, new regulation — so we hold the panel to alignment continuously, against several independent sources at once: public CISO surveys, on-task panels of real security leaders, and historical back-tests of technologies whose outcomes are already known. As new data lands, the corrections re-fit. No single dataset gets to define reality, and no calibration is ever “finished.”

When those sources disagree, that’s signal too. A survey can capture what CISOs say they’ll do while the panel reflects what tends to actually get done — and reconciling the two sharpens both. Triangulating against multiple yardsticks, rather than trusting any one, is what keeps the answer anchored to the real thing as it shifts.

And where a question type hasn’t yet earned a tight band, the product simply says so: the answer comes back marked directional, objections first, the number explicitly secondary. That restraint is the whole point — a panel you can trust where it’s confident is one that doesn’t pretend where it isn’t.

Why this is the real thing

Príncipe isn’t a prompt with a logo. It’s an instrument with a documented error model, a reproducible build, and a standing commitment to keep measuring itself against people who can prove it wrong. The personas are engineered to disagree the way a real room does; the pipeline corrects bias where bias enters; the output admits, out loud, what it doesn’t yet know. In a category racing to look certain, the durable edge is being the one you can check.

Finding a hundred real CISOs and getting a straight answer out of them used to take a year of runway. Now it takes an afternoon. Either way, the answer was never the conviction — it’s the shift you can measure.