Microsoft says the five-hour worldwide Microsoft 365 outage this week was caused by a router IP address change that caused packet forwarding issues between all other routers in its wide area network (WAN).
Redmond said at the time that the outage was the result of DNS and WAN network configuration issues caused by a WAN update and that users in all regions served by the affected infrastructure were experiencing issues accessing affected Microsoft 365 services.
The issue caused the service to be impacted in waves, peaking approximately every 30 minutes, as shown on the Microsoft Azure service status page (this status page was also affected because it intermittently showed errors “504 Gateway Timeout”).
The list of services affected by the outage included Microsoft Teams, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, PowerBi, Microsoft 365 Admin Center, Microsoft Graph, Microsoft Intune, Microsoft Defender for Cloud Apps and Microsoft Defender for Identity,
In total, it took Redmond more than five hours to resolve the issue, from 7:05 a.m. UTC when it began investigating until 12:43 a.m. UTC when service was restored.
“Between 07:05 UTC and 12:43 UTC on January 25, 2023, customers experienced network connectivity issues, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services, including Microsoft 365 and Power Platform,” Microsoft said in a preliminary post-incident report released today.
“While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully mitigated by 12:43 UTC. This incident also impacted Azure government cloud services that relied on the public cloud Azure.”
We have confirmed that the impacted services have recovered and remain stable. We are investigating a potential impact on the Exchange online service. Additionally, Exchange Survey updates will be available in your admin center under SI # EX502694.
— Microsoft 365 Status (@MSFT365Status) January 25, 2023
Microsoft also revealed that the issue was triggered when changing the IP address of a WAN router using a command that was not thoroughly checked and behaved differently on different network devices. .
“As part of a planned change to update the IP address of a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, causing them all to to recalculate their adjacency and forwarding tables,” Microsoft said.
“During this recalculation process, the routers were unable to properly forward packets passing through them.”
As the network began to recover on its own beginning at 08:10 UTC, the automated systems responsible for maintaining the health of the wide area network (WAN) shut down due to the impact on the network.
These systems included those to identify and eliminate faulty devices as well as traffic engineering systems to optimize data flow on the network.
Following the pause, some network paths continued to experience increased packet loss from 09:35 UTC until the systems were manually restarted, returning the WAN to optimal operating conditions and completing the recovery process at 12:43 UTC .
Following this incident, Microsoft says it is now blocking the execution of high-impact commands and will also require all command executions to follow guidelines for safe configuration changes.