Why Conditional Access Feels Random: How Gaps Appear from Unrelated Changes | Matthew Gribben
Entra IDConditional AccessIdentitySecurity
Why Conditional Access Feels Unpredictable: How Gaps Appear from Unrelated Changes
Conditional Access feels random not because the engine is unpredictable, but because real access outcomes depend on apps, dependencies, exclusions, scope drift, device state, network context, and session timing outside the visible policy list.
March 15, 202613 min read
If you have worked on Conditional Access for any length of time, you have probably had this conversation:
“We didn’t touch CA. Why did this break?”
Or the uglier version:
“We didn’t touch CA. Why did this start allowing people through?”
That feeling is part of why Conditional Access gets described as unpredictable, flaky, or just weird. An admin opens the policy list, sees nothing obviously different, and still ends up with a new prompt, a missed prompt, a broken client flow, or a gap nobody expected.
The problem usually is not that the engine is random. The problem is that people keep treating the policy list as if it is the whole system.
It isn’t.
Conditional Access decisions come from the interaction between policy logic and a moving environment: apps, authentication paths, service dependencies, exclusions, groups, device state, location signals, risk signals, and session timing. If any of those change, the effective result can change even when the policy objects look untouched.
That is why CA often feels unstable in mature tenants. The policy set might be static while the things it depends on keep drifting underneath it.
Seen that way, the policy list is only one input into the runtime result:
Rendering diagram…
The policy list is not the runtime reality
Post 1 made the case that Conditional Access is not a firewall. It is a decision layer in the sign-in and token path. That matters here because it explains why reviewing policy JSON is only part of the job.
A policy is just a statement of intended logic.
The actual access result depends on questions like these:
which resource was really in play?
which client and authentication path was used?
which other policies also applied at the same time?
whether the user was still in the right groups
whether the device was still compliant
whether the request still came from a trusted location
whether the service flow depended on something else behind the scenes
whether the user hit a fresh evaluation at all
That is a much larger system than “policy A says require MFA.”
Most CA incidents are really incidents in that larger system.
Overlapping policies create hidden AND logic
One of the most common reasons Conditional Access feels unpredictable is that people reason about policies one at a time.
That is almost always a mistake.
If multiple policies apply to the same sign-in, their requirements accumulate. In practice, that means the effective rule is often harsher than any one policy looks in isolation.
A simple way to picture it is as independent policy checks feeding one effective decision:
Rendering diagram…
A tenant might have:
one policy requiring MFA for all users
another requiring a compliant device for SharePoint Online
another requiring phishing-resistant auth strength for admins
another blocking certain sign-ins from outside trusted locations
Each of those policies may have been created for a sensible reason. Each may look clean when viewed alone. But the runtime experience is the combined result of all applicable policies, not whichever one an admin happened to open first.
That is where the “nobody changed anything” confusion starts. The policies did not change. The set of policies that apply to a given sign-in did.
A user moves into a more privileged group. A new app starts requesting access to SharePoint as part of its flow. A device loses compliance. Suddenly the same human doing what feels like the same task hits a completely different stack of requirements.
From the outside, that looks random.
It is not random. It is hidden AND logic across policies people were not mentally combining.
Exclusions age badly
Most Conditional Access estates become less reliable the moment they start collecting “temporary” exclusions.
You know the pattern:
exclude a pilot group because onboarding is in progress
exclude a VIP because their phone is broken
exclude a break-glass style admin account, then quietly add another
exclude a third-party app while the vendor sorts itself out
exclude a subnet because something in the office is failing
Every one of these sounds local and reasonable at the time. The problem is that exclusions are rarely revisited with the same urgency used to create them.
Over time, exclusions stop reflecting deliberate risk acceptance and start reflecting organisational sediment. They survive staff changes. They outlive projects. They get copied into new policies because “that group is always excluded.” Eventually nobody is sure whether an exception is still necessary or just old.
Exclusions are not passive. They shape the real security boundary.
A tenant can look heavily protected on paper while the actual posture is perforated by old exceptions:
admins excluded from one control but not another
service accounts excluded from broad coverage without tight replacement controls
legacy locations still treated as trusted after network changes
troubleshooting groups that slowly turn into semi-permanent bypass paths
This is one of the cleanest examples of why CA drift is not the same as policy drift. The policy still exists. The wording did not change. But the exclusions inside it have aged into something much weaker than originally intended.
New app paths change the outcome without a policy edit
A lot of Conditional Access surprises arrive when the resource path changes.
Maybe the business rolls out a new SaaS platform federated to Entra. Maybe a team starts using a different client. Maybe an old workflow gets replaced by a new one using a different enterprise app, service principal, or downstream dependency. From the user’s perspective, they are still just “signing into Microsoft stuff” or “opening the same document.” From CA’s perspective, the request path may be materially different.
This is where app targeting gets people into trouble.
Admins often assume they have covered a business service because they targeted the app name they care about. But the actual sign-in path may involve more hops than the visible app name suggests:
Rendering diagram…
That underlying path may involve:
a different cloud app than expected
Microsoft first-party service dependencies
browser and native client differences
modern auth in one case and a different client category in another
token acquisition for an underlying resource the user never consciously thinks about
That is how you get situations like these:
Teams access behaving differently because the flow touches Exchange Online or SharePoint Online
a new mobile client landing outside the assumptions built around browser testing
a line-of-business app exposing SharePoint-backed content in a way that brings SharePoint CA requirements into the experience
an onboarding flow suddenly failing because device registration or app bootstrap hits a path the production test cases never covered
Again, the policy may be unchanged. What changed was the route through the estate.
If your model is “we targeted the app, so we understand the effect,” you are already in trouble.
Service dependencies make blame assignment messy
Conditional Access troubleshooting gets harder when the service the user thinks they are using is not the only service that matters.
This is especially common in Microsoft 365, where the user-facing product is often a bundle of dependencies rather than one cleanly isolated resource. People say they are using Teams, but part of the experience depends on Exchange. They say they are accessing a file in Teams, but SharePoint is involved. They say a device registration flow broke, but the failure shows up in a sign-in journey that touched multiple services.
That creates a predictable failure mode:
the helpdesk blames the visible app
the CA admin checks the obvious policy target
nothing looks wrong there
the real dependency is somewhere adjacent
This is why broad statements like “we tested Teams” or “SharePoint is fine” are much weaker than they sound. Unless you know the dependency path, you may only have tested one face of the flow.
When Conditional Access surprises people, it is often because the dependency graph changed before the policy set did.
Group drift quietly changes scope
A Conditional Access policy can stay perfectly static while its scope moves around underneath it.
The simplest reason is group drift.
That drift can happen in several ways:
users added to or removed from a security group
changes in role assignment
nested group design becoming hard to reason about
dynamic group rules catching more or fewer users than expected
identity lifecycle changes that move users between categories
emergency admin access becoming semi-permanent
From a governance point of view this is obvious: policy targets are references, and references point to live objects that change. But in practice, many teams still talk about CA as if scope is frozen unless someone edits the policy itself.
It is not.
A user can become newly subject to strict controls, or silently fall out of them, because of a directory-side change handled by a completely different team. No CA admin has to touch the policy UI for that to happen.
That is a large part of why mature environments develop ugly ownership gaps. One team owns Conditional Access. Another owns group lifecycle. Another owns privileged role assignment. Another owns joiner-mover-leaver processes. Each may be doing sensible local work while the combined posture drifts.
If you have ever looked at a sign-in and thought “why on earth is this person in scope for that?”, you have already run into this.
Network and location drift still matters
Named locations and trusted network assumptions have a habit of decaying quietly.
An office moves internet providers. VPN architecture changes. Egress starts coming from a new address range. Remote access tooling shifts. Split tunnelling changes what public IP Microsoft sees. A cloud proxy gets introduced for some traffic but not others. Home broadband and mobile failover produce inconsistent results for the same user over the course of a day.
Then somebody says: “CA started acting weird from the office.”
No, not weird. The location context changed.
Location-based logic can be useful, but it is far less stable than many people pretend. It depends on network architecture, public IP ownership, routing patterns, client behaviour, and operations discipline outside the CA team. If those move, the sign-in classification moves with them.
This is one reason trusted locations need to be treated as living dependencies, not one-time configuration. The danger is not just false blocking. It is false trust.
A location exclusion that made sense when it mapped to a tightly controlled office edge can become much less defensible after network redesign, VPN changes, or broader egress sharing.
By the time someone notices, the policy has not changed at all. The meaning of the location signal has.
Device, compliance, and risk drift are constant
A lot of CA controls do not rely on static identity properties. They rely on live posture signals.
That includes things like:
whether the device is compliant
whether it is hybrid joined or otherwise in the expected state
whether app protection policy is present
whether user risk or sign-in risk has changed
whether a token is coming from a path that actually carries the needed device claims
This creates a category of failure that is easy to misread. The sign-in feels like the same user doing the same thing, but the state behind the sign-in is no longer the same.
Maybe the device fell out of compliance because the MDM check-in broke. Maybe the machine was reimaged and lost its expected join state. Maybe the user switched from a managed browser to a different client. Maybe Identity Protection raised the effective requirements. Maybe a device claim is available in one path but not another.
From the admin’s point of view, this looks like CA changing its mind.
What actually changed was the input state.
That distinction matters operationally because these failures usually do not get fixed in the CA blade. They get fixed in device management, identity hygiene, client configuration, or risk handling.
Broad scope is still an outage footgun
A different class of surprise comes from policies that are technically consistent but operationally too broad.
“All users” and “all cloud apps” can be the right answer in some designs. They can also be an efficient way to break onboarding flows, admin recovery paths, unmanaged edge cases, service dependencies, and workloads nobody remembered to test.
The problem is not broad scope by itself. The problem is broad scope combined with shallow understanding of what sits behind that scope.
A policy can look clean and principled while still being one unrelated change away from an outage:
a newly introduced enterprise app falls into an existing broad policy
a registration or bootstrap path turns out to need a weaker initial experience than steady-state access
an emergency admin path gets caught by stronger controls than intended
a dependency you were not thinking about suddenly becomes subject to a requirement it cannot satisfy
This is why broad CA policy is not just a security design decision. It is a systems design decision.
Report-only helps, but it does not remove surprise
Report-only is useful. It lets you observe what a policy would have done without enforcing it, which makes it one of the safer ways to assess potential impact.
But report-only has a hard limit: it only tells you about the sign-ins that actually happen while you are watching.
That coverage gap is easier to see if you separate observed traffic from untested paths:
Rendering diagram…
That means it can miss:
rare admin flows
monthly or quarter-end business processes
edge-case client paths
low-volume contractor access
bootstrap and recovery journeys
conditions that only appear from certain locations, devices, or risk states
If those scenarios do not occur during the observation window, report-only cannot validate them.
So yes, report-only is valuable. It is just observational. It is not the same thing as knowing the outcome for a defined scenario.
That distinction becomes the bridge to the next post.
The real lesson is bigger than the policy list
Once a CA estate grows past the basics, the real access posture is bigger than the set of policy objects.
It includes:
policy definitions
effective scope through groups and roles
exclusions and exception hygiene
app inventory and dependency knowledge
client and auth-path coverage
device and compliance state
network and location assumptions
token and session timing
That is why mature tenants can feel unpredictable even when nobody is recklessly editing policies. The instability is often in the surrounding system, not in the core engine.
This is also why static review runs out of road so quickly. You can review the policy objects carefully and still miss the runtime effect of a change somewhere adjacent.
What comes next
If access outcomes depend on sign-in context, dependency paths, scope drift, posture signals, and timing, then reading the policy list is not enough.
You need a way to ask a more precise question:
For this user, on this device, from this location, via this client and app path, with this policy set, what should happen?
That is where deterministic simulation testing comes in.
Because once Conditional Access stops looking random, the next problem is obvious: most teams still do not have a reliable way to predict its behaviour before production does it for them.
Chief Technology Officer writing about AI systems, software architecture, cyber security, cryptography, and the practical realities of technology leadership.