When You’re Not Sure it’s Not You
When internal IT teams are responsible for ensuring service uptime, it becomes a challenge with cloud applications like Teams – especially when you don’t know the root cause of an outage.
The reality for most organizations relying on Microsoft Teams and other Office 365 cloud services is that there’s an innate expectation that service availability is going to be met; Microsoft has enough redundant infrastructure to ensure they can meet their 99.9% service level agreement.
But what happens when there’s an issue with Teams that impacts an entire office? You can’t simply assume it’s Microsoft that’s down. Nor can you immediately blame something about the office network infrastructure either.
Earlier this year, there was a major service outage that impacted Teams for those organizations using it. The questions immediately became “Is it us? Microsoft? What’s the problem?” As IT pros, we generally use this somewhat binary thinking of “us or them”. As it turned out, it was actually a Verizon Fios outage that took out a fair portion of the entire Northeastern U.S. for a day.
If your organization was experiencing this outage, internal IT still needs to understand the scope of the problem and communicate to those impacted about the services that are unavailable. What’s needed is visibility into three key aspects of the user experience with Microsoft Teams:
- Internal resources – What endpoint and client are in use? What’s the connectivity method (e.g., think WiFi vs. wired)? Where is the user located (e.g., are they at home, in an office, etc.)?
- Routing – How are users routing to Teams from an end-to-end perspective. Is a VPN in use? Are they being forced to route through the corporate network first? Is the corporate networking infrastructure up and running properly? Do users just route directly to the Microsoft cloud? Which ISPs are traversed between the user and Teams?
- Microsoft – Are each of the relevant Microsoft services up and running? (In the case of Teams, consider OneDrive for Business, SharePoint, Teams, Azure AD for authentication, etc.)
With such visibility in place, organizations impacted by the Verizon outage when notified that an entire office of users can’t access Teams would be able to first see the status of Microsoft 365 services (sure, Microsoft has their own dashboard and Twitter feed for this but seeing that the services themselves are up and running is the first clue in determining where the problem lies). Next, having visibility into internal networking infrastructure and seeing that everything is operational is the second clue. Lastly, identifying the specific set of users that are the ones experiencing the issue (whether they are sharing a single office location or, more likely given the nature of remote work today, all work within specific geography) helps zero in on the root cause. In some cases, it could also be helpful to see down to a specific user’s experience (e.g., the CEO) to determine now just who is affected, but how the outage impacts operations.
This level of visibility empowers internal IT to respond. Even if all that can be done is to inform affected users of a service outage on Microsoft’s part, at least the employee will know to focus on something else for the moment. But in cases where the root cause can be pinpointed, internal teams can spring into action.
Ensuring Teams service quality only works when you have complete visibility into every part of the equation that makes up the user connecting to Teams. By having the three aspects of visibility mentioned above, organizations gain far more insight into the problem, decreasing response times, increasing response and remediation step accuracy, and improving workplace productivity.