Post 3 made the case for deterministic simulation.
That gets you past static review, but it still is not a full operating model.
Lots of teams can answer a one-off question like this:
- if this user signs in to this app from this device, what happens?
Far fewer can answer the questions that show up once the estate gets real:
- what changed across our baseline after this edit?
- which scenarios moved from allow to MFA, or from MFA to block?
- did this group change widen scope without the CA team noticing?
- which low-volume admin or recovery paths are now in danger?
- how do we compare posture across twenty tenants without pretending they should all have identical policy objects?
That is the difference between using simulation as a troubleshooting tool and using it as an engineering discipline.
One-off simulation tells you what happens in a single case. Regression testing tells you whether the estate still behaves the way you intended after change.
That distinction becomes more important as the number of policies, dependencies, and tenants goes up.
Testing one policy is not the same as operating a policy estate
A small tenant with a handful of CA policies can get by with careful review, a bit of manual testing, and report-only observation.
That breaks down once you have:
- baseline policies plus workload-specific overlays
- device and compliance requirements with real exceptions
- multiple client paths for the same business service
- directory drift from group, role, and lifecycle changes
- app onboarding happening outside the identity team
- several engineers making changes over time
- more than one tenant to support
At that point, the problem is no longer "did we check this policy?"
It becomes: do we have a stable definition of expected access behaviour, and can we detect when the estate moves away from it?
That is a regression-testing problem.
A lot of teams keep their baseline at the wrong layer. They baseline policy objects, exports, screenshots, or dashboard views. Those are useful references, but they are not the unit you actually care about.
The unit you care about is the decision outcome for a defined sign-in scenario.
The baseline should be the scenario outcome, not the policy object
Exact policy identity is a poor baseline for Conditional Access.
Two tenants can implement the same control intent with different:
- group names and object IDs
- cloud app assignments
- named locations
- device filters
- policy splits and naming
- exclusion handling
- licensing constraints
Even inside one tenant, a sensible refactor can change policy structure without changing the effective access result.
One tenant might express its workforce baseline with one broad policy and two overlays. Another might split the same intent across six narrower policies. If you baseline exact object identity, those tenants look incomparable. If you baseline effective outcomes, they may be functionally aligned.
That is the useful shift.
The baseline is not:
- "policy 7c9f still exists"
- "the JSON export is byte-for-byte similar"
- "the dashboard still looks green"
The baseline is closer to this:
- a standard workforce user on a compliant device, from a trusted location, signing into SharePoint in a browser with a fresh session, should be allowed without extra challenge
- an admin on an unmanaged device from an untrusted location to an admin surface should be blocked
- a workforce user on an unmanaged device to Exchange Online should hit MFA but not a compliant-device requirement that only applies to SharePoint-backed flows
That is the layer where regression testing becomes useful.
Start with a canonical scenario catalog
If you want repeatable testing, you need a catalog of scenarios that actually describes the estate.
Not a vague checklist. Not "test Teams." Not one happy-path sign-in per app.
A useful scenario catalog defines the access situations you care about in a stable, reusable way.
That usually means each scenario has fields like:
- actor type
- target resource or business action
- client path
- device posture
- location context
- session state
- expected outcome
For example:
workforce.sharepoint.browser.managed.trusted.freshworkforce.exchange.mobile.unmanaged.untrusted.freshadmin.azureportal.browser.unmanaged.untrusted.freshguest.teams.browser.managed.untrusted.existingregistration.device-enrollment.browser.unmanaged.untrusted.fresh
The point is not to invent a cute naming scheme. It is to make the scenarios durable.
A good catalog includes:
- steady-state user access
- privileged and admin paths
- onboarding and registration flows
- break-glass and recovery paths
- guest and contractor patterns, where they exist
- low-volume but high-impact journeys that dashboards tend to miss
This is where many teams still under-test. They cover the common browser flows and stop there, then act surprised when mobile, desktop, bootstrap, or admin recovery paths behave differently.
Tenant-specific objects need to map into a common ontology
Once you move past one tenant, raw object identity gets even less useful.
Tenant A may use:
- group
All-Staff-CA - named location
HQ NAT - policy
CA-Base-MFA
Tenant B may use:
- group
SG-Workforce - named location
Melbourne Office Egress - policies
Global-MFAandSharePoint-Managed-Only
Those are different objects. They may still represent the same concepts.
That is why cross-tenant regression work needs a common ontology.
In plain English: define the canonical categories first, then map each tenant's real objects into them.
Example categories might look like this:
actor.workforceactor.admin.privilegedactor.guestresource.sharepoint-onlineresource.exchange-onlineresource.admin-surfaceclient.browser-modernclient.mobile-moderndevice.managed-compliantdevice.unmanagedlocation.trusted-egresslocation.untrustedsession.fresh
Then each tenant maps its own groups, roles, apps, locations, and posture signals into those categories.
Rendering diagram…
That is how you compare posture without demanding identical policy construction.
It also helps inside a single tenant. A common ontology forces you to describe what a scenario is before arguing about which policy object happens to implement it today.
Expected outcomes should be versioned artifacts
Once you have stable scenarios, you need stable expected results.
Those expectations should be stored like any other engineering artifact: versioned, reviewable, and diffable.
The important part is to version the expected outcome, not just the input policy export.
A practical expected-outcome record might capture:
- scenario ID
- baseline version
- effective result: allow, challenge, block, restriction
- required grant controls
- relevant session controls
- rationale or policy notes
- tenant-specific approved deviations, if any
A simple shape might look like this:
[]
[]
This feels boring right up until you need to answer a real question six weeks later.
Then it becomes very useful.
You can see:
- what the expected behaviour was before the change
- when that expectation changed
- who reviewed it
- which deviations were intentional
- whether the runtime result now diverges from the approved baseline
That is the basis for real drift detection.
Regression testing is mostly about diffs
The most useful output from a CA regression system is not a generic "pass."
It is the delta.
You want to know which scenarios changed, how they changed, and whether the change was intended.
That means diffing a candidate state against the current baseline and classifying the result.
Typical deltas worth highlighting include:
- allow -> MFA
- allow -> block
- MFA -> block
- compliant-device requirement added
- auth strength requirement changed
- session restriction added or removed
- previously non-applicable policy now entering scope
That is what blast-radius prediction actually looks like in identity policy work.
Not a generic risk score. Not a red-yellow-green dashboard. A concrete answer to: which scenarios moved?
This is also why regression runs should not be limited to explicit CA edits.
Some of the most consequential changes come from adjacent systems:
- a group membership change
- a new app or service principal entering scope
- a named-location change after network work
- device-state or compliance logic changing underneath CA
- a troubleshooting exclusion being added and never cleaned up
If those changes can alter the effective sign-in outcome, they belong in the regression loop.
The rollout path should be deliberate
Once you treat CA as a tested system, rollout gets a lot less theatrical.
The workflow is straightforward:
Rendering diagram…
Each step has a distinct job.
1. Author
Make the policy change, group change, mapping change, or exception change in a candidate state.
2. Simulate
Evaluate the scenario catalog against that candidate state.
This is where deterministic simulation earns its keep. You are not waiting for traffic. You are asking the question directly.
3. Diff
Compare the candidate results to the approved baseline.
This is where you see whether the change is narrow and intended or much broader than the author realised.
4. Review blast radius
If five admin scenarios and one enrollment path moved, that is a different conversation from "Teams got an extra MFA prompt from unmanaged devices." The review should be based on deltas, not guesswork.
5. Deploy in report-only
Report-only is still useful here, just later in the sequence.
It gives you observational feedback from production sign-ins without enforcing the new behaviour. That helps confirm real traffic patterns and catch cases your catalog may not model yet.
But report-only should not be the first serious attempt to understand impact. By that point you should already have a prediction.
6. Compare prediction to reality
This step is easy to skip and worth formalising.
Did production report-only behaviour line up with the predicted scenario diffs?
If not, that tells you something important:
- the model is incomplete
- the mapping is wrong
- a dependency path was missed
- the tenant contains runtime cases the catalog does not yet represent
That is not a failure of testing. It is how the test system improves.
7. Enforce
Enforcement should be the last step, after the unknowns are reduced to something you are willing to own.
Cross-tenant baselining does not mean cloning every tenant
This is the point MSPs often get wrong in both directions.
One mistake is to demand object-level sameness across all customers. That usually collapses on contact with reality. Different tenants have different licensing, app portfolios, risk tolerance, onboarding flows, geography, and exception history.
The opposite mistake is to give up on comparability entirely and settle for dashboards plus local folklore.
That is not a serious operating model either.
The workable middle ground is semantic baselining.
In practice, that means:
- keep the scenario catalog canonical
- keep the ontology canonical
- let the tenant mapping be local
- keep deviations explicit and versioned
That lets an MSP say something useful across customers without pretending every tenant should have the same JSON.
For example, you can ask a cross-tenant question like:
- for privileged administrators, what is the expected outcome for unmanaged-device access from untrusted locations to admin surfaces?
Tenant A may answer with a straight block. Tenant B may answer with phishing-resistant auth plus compliant device. Tenant C may carry a documented temporary exception for a migration window.
Those are different implementations, but they are still comparable outcomes.
That is much more honest than flattening everything into one template or waving away the differences as too hard.
A sensible multi-tenant baseline usually has three layers:
- required scenarios every managed tenant should satisfy
- optional scenarios that apply only where the service model needs them
- declared exceptions with owner, reason, and review date
That is enough structure to detect drift without pretending all customers are clones.
Dashboards are useful. They are not regression testing.
A lot of managed identity work gets stuck at the dashboard layer.
Dashboards can tell you things like:
- which policies exist
- which sign-ins hit MFA or block outcomes
- where report-only evaluations are showing up
- which users are seeing failures
- how often a control fired last week
That is operationally useful.
It is also not the same thing as regression testing.
Dashboards are mostly observational.
They depend on real traffic. They are biased toward common paths. They are weak at low-volume or not-yet-observed scenarios. They do not version expected outcomes. They do not tell you how a candidate state compares with the approved baseline before you ship it.
A dashboard can tell you that a sign-in failed yesterday. Regression testing can tell you that a proposed change will alter twelve scenarios, including one enrollment path nobody has exercised this week.
Those are different capabilities.
The same goes for one-off checks in the What If tool. They are useful for examining a case. They are not a regression system until the inputs, expectations, and diffs are structured and repeatable. Microsoft also notes that the What If tool does not test Conditional Access service dependencies, which is exactly the kind of hidden interaction a broader regression model has to account for.
What this category actually is
At this point the category line is pretty clear.
This is not just policy review with nicer visuals. It is not a report-only dashboard with a simulation button glued on. And it is not configuration compliance alone, even though configuration checks still have value.
The stronger category is closer to:
- identity policy simulation
- regression testing
- change impact analysis
- cross-tenant baselining
That framing is more accurate because it describes the real job.
You are not merely checking whether policy objects look sensible. You are maintaining a tested definition of expected access behaviour and looking for drift when the estate changes.
That is the operational jump.
For teams managing more than one tenant, it is also the only model that scales cleanly without turning every customer into a special case or forcing them into a fake template.
Closing
This series started by fixing the mental model. Conditional Access is not a firewall.
Then it moved into why outcomes feel random, why static review runs out of road, and why deterministic simulation belongs between design and rollout.
The last step is to make that repeatable.
The durable model is:
- define canonical scenarios
- map tenant-specific reality into a common ontology
- version the expected outcomes
- diff every meaningful change
- use report-only to compare observation with prediction
- enforce only after the decision surface looks the way you intended
That is what turns Conditional Access from a nervous admin exercise into something closer to engineering.
And for MSP-scale identity work, it is the difference between managing posture and merely watching it.
Series start: Conditional Access Is Not a Firewall: How Entra ID Conditional Access Actually Works
Sources
- Microsoft Learn: The Conditional Access What If tool
- Microsoft Graph: What If evaluation API
- Microsoft Learn: Conditional Access Policy Insights: Monitoring and Evaluation