Understanding a Microsoft Global Service Outage
What it Means & How to Prepare
Maintaining business continuity when an issue arises has proven to be a challenge many organizations struggle with. A global pandemic being thrown into the mix in Q1 of 2020 (one that many businesses are still navigating through) introduced a new set of problems for both service providers and businesses reliant on those services. As more users were conducting their business operations remotely, the influx of new users accessing cloud services online in a short span of time made it difficult for the workloads to keep up with the demand, causing systems to slow down and even go out completely for periods of time. In the moments after a Microsoft service outage occurs, there are usually more questions than answers and the typical business response is reactive (to minimize the impact of the outage on user productivity, revenue, etc.) as opposed to developing a proactive strategy beforehand that accounts for service downtime and allows your business to continue forward amidst service downtime. The goal of this blog is to walk you through a Microsoft global service outage to better understand what it means for your business and the steps you can take before the next one occurs.
Type of Service Outage
When it comes to service outages it is important to first identify the scale of the outage and what that means for your organization. It’s also important to identify whether it was a Microsoft caused outage in the first place. 90% of performance issues that enterprises face with Microsoft 365 are caused by their own infrastructure and network, and not Microsoft. This means that a business’s IT team may have either missed/overlooked a problem within their own network or assumed that Microsoft must have been the issue. ‘Microsoft is down’ is an easy answer for an IT team to present to the rest of an organization after receiving multiple emails and helpdesk tickets; it’s a safe bet if multiple users are making the same outage claims that it must be a large scale issue, however without a thorough examination into the problem it can’t properly be confirmed and addressed.
There are three different outage types: global, regional, and user-specific. The impact on your organization and issues that may persist post-outage depend on the outage type that occurred in the first place. The following will be a walkthrough explanation of each type of outage; let’s create a scenario that includes three components: an organization which we will refer to as ‘Company X’, an employee of Company X, and service user we will refer to as John and finally Company X’s IT Administrator we will refer to as Jane.
John is working remotely and in a meeting with a colleague on Microsoft Teams when suddenly the camera and voice functionality fails; he tries to use the chat function to no avail. He then decides to send his colleague an email to apologize for the technical difficulties he’s facing and realizes that his Outlook email service has also been impacted by whatever is going on. John then decides to check the Microsoft 365 health status Twitter account to see if this may be a larger problem. His worst fear has been confirmed – Microsoft is experiencing a global outage across all their services, but especially with Teams, OneDrive & Outlook; they are diligently working on repairing the issue, but an expected time of resolution has yet to be provided. It’s important to note that the outage declaration from Microsoft isn’t in real-time; usually happens within the hour of the outage occurrence.
John decides to send a helpdesk ticket to Jane the IT Administrator who has received countless other tickets within the past 15 minutes from other employees about the same issues. Jane checks the Microsoft service health dashboard in her portal to examine the health of the Microsoft 365 services that are trying to be accessed. The image below provides the steps that Jane will take to properly identify that a global outage has indeed occurred.
When it comes to testing and attempting to remediate the issue Jane must rely solely on the data provided by Microsoft. She’s provided with great information, but only after the fact; the impact has already been felt by Company X, and productivity and revenue have been lost.
If Jane were using Martello not only would she have known about this impending outage sooner, but she would have been privy to real-time visibility on everything that is happening with the Microsoft 365 services in her tenant. Below are some screenshots of a real Microsoft Global outage viewed through a Martello dashboard. The affected Office 365 operations are identified by colour and icon coding to alert the outage; the chart on the left shows all of the services when the outage occurred (and when they were restored) and the chart on the right breaks down the degradation caused by the outage for each individual service.
In this scenario John is experiencing the same service-related issues but this time they aren’t as far-reaching as the global outage. This outage is only affecting organizations in and around Company X’s location.
John sends a helpdesk ticket to Jane once again and the following is how Jane assesses the situation at hand:
If Jane were to use a Martello solution in this scenario, she would be able to validate how well Teams is working across Company X’s many offices and pinpoint the regions that are experiencing issues. In the screenshot below you see Martello’s dashboard examining a Teams issue; the charts and graphs showcase the issue broken down by region, time, and actions within the service that are being affected.
This time around John is experiencing issues with the Teams calling functionality and sends a helpdesk ticket to Jane.
Jane follows her usual protocol and determines that this issue may be specific to John because she hasn’t received any other helpdesk tickets related to service issues or other notifications regarding anything similar from any other Company X employees.
If Jane were to use Martello’s solution she would have clear insight into the service availability of each employee on Company X’s network. In the screenshot below you can see how John and the workloads he has engaged are clearly identified and can be drilled down further upon.
How can I better protect my business?
Ultimately John, Jane, and Company X would have greatly benefitted by having a Martello solution on their side as it provides true end-to-end Microsoft 365 service delivery performance management capabilities. 24/7 availability and performance monitoring allow for service degradation pattern detection and immediate incident response; be aware of issues before users and their productivity are affected. Martello allows you to take control and troubleshoot incidents proactively, allowing more time to allocate IT support to where it is needed. When you plan before a Microsoft service outage occurs you are able to stay ahead of any potential issues.
Stay Ready with Martello