In Part 2, we built out the dual-core brain of autonomous operations - the Network Knowledge Graph (NKG) as the spatial map of how everything connects, and the Temporal Knowledge Graph (TKG) as the timestamped logbook of everything that changes. We showed how linked identifiers bridge the identity gap across siloed systems, turning abstract ticket IDs into traceable network elements.
Now we put it to work. The most revealing test of any autonomous system is not how it handles a clean, obvious failure - it is how it handles the messy, ambiguous situations that currently take human experts hours to unravel. Change-related incidents are exactly that test.
The Risk Window After a Change
Every network operations team knows this feeling: a maintenance window closes successfully, the on-call engineer stands down, and then - 45 minutes later - something starts going wrong. Not on the element that was changed, but somewhere else. Something adjacent.
This is the risk window after a change, and it is one of the most expensive problems in network operations. The change looks fine in isolation. The degradation looks unrelated at first glance. And by the time a team of domain experts – across domains has manually correlated timestamps across four different systems, significant customer impact has already occurred.
This is the scenario we will walk through in this part.
Midnight: The Change Is Executed
At 12:00:00 AM, during a scheduled maintenance window, a Method of Procedure (MOP) initiates a software upgrade on a core aggregation router - a critical node carrying traffic for multiple services and customer segments. The upgrade itself takes time - configuration changes, service restarts, validation checks. By 12:07:00 AM, the MOP completes. Health checks pass. The on-call team sees no alarms. For the next ten minutes, everything looks stable. From every indication, the maintenance window is a success.
The TKG silently logs the full execution window with precision: (MOP-ID-X, Alters, Aggregation-Router-A, Start Time 12:00:00 AM, End Time 12:07:00 AM).
Background monitoring agents continue their watch.
12:17:00 AM: Something Shifts
Ten minutes after the MOP completes, the KPI Drift Monitor detects subtle but statistically significant dips across several performance indicators - throughput dropping on specific service paths, latency creeping up on a subset of sessions, and a slight uptick in packet retransmissions. None of these individually cross an alarm threshold. Together, they tell a different story.
To a human observer reviewing separate dashboards, these look like minor, unrelated fluctuations across different services. The ten-minute gap between the MOP completion and the KPI dips makes the connection even less obvious. The temptation is to log them as noise and investigate in the morning.
The TKG records each drift point with sub-second precision: (Aggregation-Router-A, Experiences, KPI Degradation, Time 12:17:00 AM).
The Three-Step Reasoning Chain
With the NKG and TKG working in tandem, the autonomous system performs a three-step reasoning process that would take a human team hours to complete manually.
Step 1: Pinpointing the Moment of Drift
The KPI Drift Monitor runs Change Point Detection (CPD) - a statistical technique that identifies when a metric deviates from its historical baseline. It flags 12:17:00 AM as the change point across multiple KPIs and immediately cross-references this against the TKG: a MOP was executed on Aggregation-Router-A between 12:00:00 AM and 12:07:00 AM, completing ten minutes earlier. The temporal proximity alone is not proof - but the co-occurrence of multiple KPI dips pointing to the same network segment as part of the NKG is enough to trigger deeper investigation.
Step 2: Resolving the Identity Gap
The Change Management Agent takes the MOP ticket identifier - CI-ID-xxxx - and traverses the NKG to resolve it to the actual network element:
CI-ID-xxxx → Linked_To → Aggregation-Router-A → Identified_As → NE-ID-yyyy
What was an abstract ticket ID is now a specific, known aggregation router with a precise position in the service topology - one that sits on the path of multiple services and customer segments.
Step 3: Validating the Functional Path
The RCA Agent asks the critical question: "Are the services experiencing KPI degradation at 12:17:00 AM dependent on the aggregation router that was upgraded at 12:00:00 AM?"
It traverses the NKG and confirms that Aggregation-Router-A sits on the service path of the affected traffic flows. The software upgrade introduced subtle changes in traffic handling behaviour - queue scheduling, buffer allocation, or forwarding table updates - that only manifested under live traffic conditions after the maintenance window closed. No other changes occurred in the same window. No other anomalies are present to confuse the picture.
The system elevates this from correlation to causality - with traceable, verifiable evidence.
From Diagnosis to Action
Once the causal chain is established with high confidence, the system does not stop at diagnosis. The Remediation Agent presents a structured evidence artifact to the on-call engineer:
- What happened: Aggregation-Router-A upgrade executed between 12:00:00 AM and 12:07:00 AM introduced changes in traffic handling behaviour
- What it caused: KPI dips across throughput, latency, and packet retransmissions detected at 12:17:00 AM across dependent service paths
- Who is affected: The specific customer segments and network slices traversing Aggregation-Router-A
- Recommended action: Rollback Aggregation-Router-A to the pre-change configuration
The engineer reviews the evidence - not a vague alert, but a traceable causal narrative anchored in the NKG and TKG - and approves the remediation. The rollback executes. All of this within minutes of the initial drift, before customers experience meaningful impact.
This is the shift from reactive triage to proactive assurance in practice.
Conclusion: Why This Matters Beyond the Scenario
The midnight maintenance scenario illustrates something important: the hardest problems in network operations are not the obvious failures. They are the ones where the cause and effect are separated by time, by domain boundaries, and by the fragmented identity of data across systems.
Change-related incidents are disproportionately represented in major outages precisely because they exploit these gaps. The NKG and TKG, working together through a coordinated agent workflow, close those gaps - not by making humans work faster, but by giving them the right evidence at the right time to make confident decisions.
In this part, we have covered:
- The "risk window after a change" as a real and costly operational challenge
- How the TKG captures change and anomaly events with sub-second precision
- The three-step agent reasoning chain: drift detection, identity resolution, and topological validation
- How the system moves from diagnosis to human-approved remediation in minutes
Looking Ahead: Part 4 – The Anatomy of an Autonomous Agent
We have now seen what agents do. In Part 4, we will open up the hood and look at how they actually work - the dual-core design that combines generative AI for understanding unstructured data with deterministic reasoning for verifiable conclusions. We will also look at the trust layer that ensures agents operate within defined boundaries, and why that governance is not a constraint on autonomy but the very thing that makes autonomy possible.


