A practical playbook for monitoring Microsoft 365 service health, separating Microsoft-side incidents from tenant-side problems, communicating clearly, and restoring business productivity faster.
Do not let a Microsoft 365 issue become a business outage.
MSP Corp helps Canadian organizations monitor Microsoft 365, reduce recurring support tickets, escalate the right issues, and build a safer operating model for email, Teams, SharePoint, OneDrive, identities, and backups.
When Outlook will not send, Teams calls drop, SharePoint feels slow, or users cannot sign in, the first question is simple: is Microsoft 365 down, or is something broken inside our own environment?
The answer is not always obvious. Microsoft 365 is a cloud service, but your real user experience depends on more than Microsoft’s platform. It also depends on identity configuration, licensing, endpoints, network routing, DNS, browser state, security controls, third-party integrations, local internet providers, and how quickly your team recognizes patterns across tickets.
This playbook explains how to use Microsoft 365 service health as part of a complete monitoring and response process. The goal is not just to know whether Microsoft has posted an advisory. The goal is to reduce confusion, protect productivity, communicate with confidence, and make sure every Microsoft 365 incident produces a better operating model.
What Microsoft 365 Service Health Actually Shows
Microsoft 365 Service Health is available in the Microsoft 365 admin center under Health > Service health. Microsoft describes it as the place to check known cloud-service problems across services such as Exchange Online, Microsoft Teams, Office on the web, Microsoft Dynamics 365, and other subscribed services before you spend time troubleshooting or opening a support case.1
The Service health page gives administrators a current view of active incidents and advisories. Microsoft also distinguishes between Microsoft-side items it is working on and issues detected in your environment that may require your organization to act.1, 2
Service Health is most useful when you treat it as one signal in a broader operations workflow. It can answer questions like:
- Is there an active Microsoft incident affecting Exchange Online, Teams, SharePoint, OneDrive, Microsoft Defender, Purview, or another service?
- Is the issue an incident, where a service or major function is unavailable, or an advisory, where impact is limited, intermittent, or has a workaround?1
- Has Microsoft posted a status update, mitigation step, service-restored notice, or post-incident report?1, 6
- Is there a tenant-specific issue your organization must act on?
- Is a reported symptom unrelated to Microsoft and more likely caused by local network, device, browser, DNS, identity, or policy changes?
Why Service Health Alone Is Not Enough
A healthy Microsoft dashboard does not always mean your users are healthy. The reverse is also true: an active Microsoft advisory does not always explain every ticket. A mature Microsoft 365 operations program uses several layers of evidence.
Microsoft signals
Service Health, Message Center, Microsoft 365 Health dashboard, service issue history, Microsoft support updates, and Microsoft Graph service communications.
Tenant signals
Sign-in failures, conditional access changes, license assignment problems, mail-flow changes, Teams policies, SharePoint permissions, and admin audit events.
Network signals
Microsoft 365 network connectivity assessments, ISP routing, DNS, VPN, proxy, firewall inspection, Wi-Fi performance, and remote-user location patterns.8
User signals
Ticket spikes, repeated symptoms, VIP reports, contact-centre issues, failed meetings, delayed email, Teams call quality complaints, and helpdesk trend data.
Google’s Site Reliability Engineering guidance recommends focusing monitoring on user-facing symptoms such as latency, traffic, errors, and saturation when you can only track a small set of metrics.12 For Microsoft 365 operations, the same mindset applies: measure what users experience, not only what the admin portal reports.
The Microsoft 365 Service Health Monitoring Model
For small and mid-sized organizations, the practical model is simple: centralize signals, classify impact, assign ownership, and communicate before rumours fill the gap. The table below gives IT leaders and operations teams a working structure.
| Monitoring layer | What to watch | Why it matters | Owner |
|---|---|---|---|
| Microsoft 365 Service Health | Active incidents, advisories, issue history, service-restored notices, post-incident reports | Confirms whether Microsoft is already investigating a known issue and reduces wasted troubleshooting time.1 | Microsoft 365 admin or managed IT provider |
| Microsoft 365 Health dashboard | Health snapshot, service health for top apps, security best-practice signals, desktop update posture | Gives administrators an executive-level view of the Microsoft 365 environment, not only outages.2 | IT manager or global reader |
| Message Center | Planned changes, retirements, admin-impact posts, data privacy messages, major updates | Planned maintenance is not shown in Service Health, so Message Center is needed for proactive change management.1, 7 | Microsoft 365 operations lead |
| Microsoft Graph service communications | Service issues, health data, Message Center posts, and incident reports via API | Allows service-health data to feed dashboards, ticketing workflows, or managed-service reporting.4, 5, 6 | Automation owner or MSP |
| Network connectivity | Location-level network assessments, remote versus onsite patterns, Microsoft 365 connectivity issues | Helps identify whether poor performance comes from local network design or internet path problems.8 | Network administrator |
| Helpdesk and user experience | Ticket spikes, affected locations, VIP impact, repeated symptoms, failed workflows | Shows real business impact and can reveal incidents before any official advisory appears. | Helpdesk lead |
Key takeaway: Microsoft 365 Service Health tells you what Microsoft knows. Your monitoring model tells you what your business is experiencing.
Roles: Who Owns What During a Microsoft 365 Incident?
When email, collaboration, or identity breaks, the fastest teams do not start by debating ownership. They already know who is coordinating, who is diagnosing, who is communicating, and who can approve business workarounds.
| Role | Responsibilities | Best-fit person |
|---|---|---|
| Incident lead | Owns severity, timeline, next-update cadence, decisions, and escalation. | IT manager, operations leader, or MSP service lead |
| Microsoft 365 admin | Checks Service Health, tenant configuration, Message Center, admin audit trails, and support cases. | Global reader, service support admin, or delegated MSP admin |
| Helpdesk lead | Tracks user reports, confirms scope, identifies patterns, and updates ticket templates. | Internal IT coordinator or service desk manager |
| Communications lead | Publishes employee, executive, and customer-facing updates where appropriate. | Operations, HR, communications, or service owner |
| Security lead | Rules out credential theft, phishing, suspicious sign-ins, malicious inbox rules, or data exposure. | Security owner, vCISO, or managed detection and response team |
| Business approver | Approves temporary workarounds when productivity, compliance, or customer commitments are affected. | COO, department head, or executive sponsor |
Keep permissions tight. Microsoft recommends using the fewest permissions needed and limiting the number of users with administrative privileges. Several Microsoft 365 roles can monitor service health, including Helpdesk Administrator and Service Support Administrator, without giving everyone global administrator access.15
The Response Playbook
Use this workflow whenever there is a Microsoft 365 service-health concern, a spike in related tickets, or a business-critical symptom involving Exchange Online, Teams, SharePoint, OneDrive, Entra ID, Defender, Purview, or the Microsoft 365 admin center.
Prepare before the first ticket arrives
Preparation is the difference between a calm response and a noisy scramble. NIST’s incident handling guidance emphasizes preparation, detection and analysis, containment, eradication and recovery, and post-incident activity as core incident-handling phases.11 For Microsoft 365 operations, preparation should include:
- Named owners for Service Health, Message Center, Microsoft support, helpdesk coordination, and internal communication.
- Saved admin center URLs, escalation contacts, Microsoft support access, and backup communication channels.
- Admin roles configured with least privilege, including service-health visibility for the right team members.
- Email and Teams templates ready for employee, executive, and helpdesk updates.
- A standing process for Microsoft 365 changes using Message Center and your weekly, monthly, and quarterly administration tasks.
Detect and confirm the symptom
Start with the user symptom, not the assumed cause. Document the first report, affected workload, affected location, number of users, device type, network path, error message, and time the issue started.
- Which workload is affected: Outlook, Exchange Online, Teams, SharePoint, OneDrive, Entra ID, Defender, Purview, or admin center?
- Is the issue affecting one user, one department, one office, remote users, or everyone?
- Did the issue start after a known change, policy update, license change, network change, or security control update?
- Are users seeing the same error message?
- Are mobile, desktop, and web clients affected the same way?
- Can a test account reproduce the issue?
- Does Microsoft 365 Service Health show an active incident or advisory?
- Does Message Center show a recent rollout, retirement, or admin-impact change?
- Do sign-in logs or conditional access reports show unusual blocks?
- Is there any sign this could be a security incident rather than a service outage?
Check Microsoft 365 Service Health
Open the Microsoft 365 admin center and go to Health > Service health. Review the overview, active issues, organization-specific issues, and issue history. Microsoft says administrators can also report an issue from the Service Health page when they experience a service problem that is not listed, allowing Microsoft to compare signals across organizations.1
If the admin center itself is inaccessible, Microsoft advises using the public service status page and following Microsoft 365 Status communications for certain events.1 Keep a backup communication path for your team so you are not dependent on Teams or Exchange during a Microsoft 365 interruption.
Classify the incident
Classify the issue before escalating. This reduces panic, helps leadership understand impact, and keeps the helpdesk from applying inconsistent workarounds.
| Severity | Definition | Examples | Communication cadence |
|---|---|---|---|
| Sev 1 | Business-critical service unavailable for most users or a regulated operation. | Email down company-wide, Teams meetings failing across departments, widespread sign-in failure. | Initial update within 15 minutes, then every 30 to 60 minutes. |
| Sev 2 | Major workflow degraded for a group, office, or critical department. | SharePoint unavailable for finance close, OneDrive sync failure for a project team, executive mailbox issue. | Initial update within 30 minutes, then every 60 to 90 minutes. |
| Sev 3 | Localized or intermittent issue with workaround available. | Teams call quality degraded for one office, Outlook desktop issue resolved by web app. | Status update when confirmed, then at major milestones. |
| Sev 4 | Informational, planned change, or minor degradation. | Feature rollout, advisory with limited impact, planned change from Message Center. | Normal change communication. |
Separate Microsoft-side, tenant-side, network-side, and security-side causes
Do not stop at the first plausible explanation. A Microsoft advisory might be real, while your organization also has a conditional access misconfiguration, expired certificate, DNS issue, or compromised account. Your triage should sort the issue into one of four lanes:
- Microsoft-side: Confirmed active incident or advisory for a Microsoft service.
- Tenant-side: Configuration, policy, license, user, mail-flow, retention, endpoint, or admin change inside your tenant.
- Network-side: Office, ISP, VPN, proxy, DNS, firewall inspection, or routing problem affecting Microsoft 365 connectivity.
- Security-side: Identity compromise, phishing, risky sign-ins, malicious inbox rules, data exfiltration, or suspicious OAuth app behaviour.
When the symptoms involve sign-ins, unusual prompts, mailbox rules, or unexpected access blocks, treat the event as a potential security incident until ruled out. The related playbook for an alert-to-action triage workflow can help teams move faster without skipping evidence.
Communicate early, briefly, and consistently
Atlassian’s incident communication guidance stresses acknowledging the issue quickly, summarizing known impact, promising further updates, and communicating consistently across channels.13, 14 For Microsoft 365 issues, the first message should not over-explain. It should calm the business and stop duplicate tickets.
Subject: Microsoft 365 service issue under investigation We are investigating reports of [symptom] affecting [users, department, location, or workload]. Current impact: [plain-language impact] Known workaround: [workaround or "none confirmed yet"] What we are checking: Microsoft 365 Service Health, tenant configuration, network path, and recent changes. Next update: [time] Please avoid submitting duplicate tickets unless you have a different symptom or business-critical impact.
Escalate with useful evidence
If Microsoft has posted an incident, track the issue ID and updates. If Microsoft has not posted an incident but you have repeatable evidence, open a support request with clear impact, timestamps, affected services, affected users, diagnostic steps, screenshots, and correlation IDs where available.
If you use automation, Microsoft Graph service communications APIs can retrieve service health issues and Message Center communications for a tenant, including current and historical health data and service messages, with the required admin-granted permissions.4, 5
Restore service, then run the post-incident review
When the issue is resolved, send a short resolution note, capture the timeline, update the ticket record, and identify prevention work. Microsoft Service Health can show service-restored status and post-incident reports when published for specific issues.1, 6
For recurring incidents, use a structured RCA process instead of treating every outage as a one-off. Capture what failed, why detection did or did not work, how long communication took, what workaround was used, and which control will prevent recurrence.
Workload-Specific Triage: What to Check First
Microsoft 365 problems feel different depending on the workload. Use the quick checks below to avoid wasting time in the wrong admin centre.
| Workload | User symptom | First checks | Likely lanes |
|---|---|---|---|
| Exchange Online and Outlook | Email delayed, cannot send, mailbox unavailable, Outlook search broken | Service Health, Exchange message trace, mailbox access, transport rules, Outlook web access, mobile versus desktop comparison | Microsoft-side, tenant-side, security-side |
| Microsoft Teams | Calls fail, chat delayed, meetings unavailable, poor call quality | Service Health, Teams admin center, call quality dashboard, network path, VPN and firewall inspection, affected meeting region | Microsoft-side, network-side, tenant-side |
| SharePoint and OneDrive | Files unavailable, sync failures, permissions errors, slow libraries | Service Health, SharePoint admin center, OneDrive sync health, permissions, storage, recent sharing or retention changes | Microsoft-side, tenant-side, network-side |
| Microsoft Entra ID | Sign-in failures, repeated MFA prompts, conditional access blocks, app access denied | Sign-in logs, conditional access reports, identity protection, recent policy changes, service health | Tenant-side, security-side, Microsoft-side |
| Defender and Purview | Portal unavailable, compliance workflow delayed, alerts missing, policies not applying | Service Health, Message Center, licensing, role permissions, policy changes, incident queue | Microsoft-side, tenant-side |
| Admin center | Admins cannot load portal, settings fail, support page inaccessible | Public service status, alternate admin account, different browser/network, admin mobile app, Microsoft support access | Microsoft-side, network-side, identity-side |
Planned Changes: Use Message Center, Not Service Health
One of the most common Microsoft 365 operations mistakes is treating Service Health as the only health source. Microsoft notes that planned maintenance events are not shown in Service Health. Planned maintenance, feature changes, retirements, data privacy messages, and admin-impact updates belong in Message Center.1, 7
That matters because many “outages” are really change-management failures. A feature rollout changes a workflow. A retirement breaks an integration. A security setting becomes available but nobody assigns an owner. A Teams or Outlook update changes user behaviour, and the helpdesk is surprised.
Build a weekly Message Center review into your Microsoft 365 operating rhythm. Filter for admin impact, user impact, major updates, retirements, and data privacy. Assign owners, due dates, and communication tasks. For AI-related changes, connect Message Center review with AI governance and change-control approvals, especially when Copilot, Graph-connected data, or new app permissions are involved.
Network Performance: The Hidden Microsoft 365 Service Health Variable
If Microsoft 365 is technically healthy but users experience slow Outlook, choppy Teams meetings, or sluggish SharePoint access, look at the network path. Microsoft 365 admin center network connectivity provides tenant-specific network metrics and location-level assessments to help administrators identify network architecture issues and insights.8
Watch for local patterns
If all remote users are fine but one office is slow, you may have a firewall, DNS, ISP, Wi-Fi, VPN, or proxy issue. If one department is affected across locations, you may have a policy, permission, licensing, or app-specific issue. If everyone is affected across networks, Service Health and Microsoft support become higher-priority checks.
For hybrid or multi-site organizations, Microsoft 365 monitoring should connect with network standards, firewall reviews, VPN or ZTNA decisions, and office design. If remote access architecture is part of the problem, compare the options in a ZTNA versus VPN migration strategy before making another short-term exception.
Do Not Confuse Service Health With Backup or Business Continuity
Microsoft 365 Service Health helps you understand platform availability. It does not guarantee your organization can recover from accidental deletion, malicious deletion, ransomware encryption, misconfigured retention, or user-driven data loss.
Microsoft’s compliance documentation describes cloud security and compliance as a shared responsibility, with Microsoft responsible for its cloud-service obligations and customers responsible for protecting data in a way that satisfies their own compliance requirements.10 Microsoft 365 Backup is designed to protect and restore selected SharePoint, OneDrive, and Exchange data and emphasizes recovery to a healthy state as a core business-continuity concern.9
That means Microsoft 365 service monitoring should connect to backup, retention, and continuity planning. For practical next steps, review what Microsoft covers versus what your organization still needs and make sure your recovery process is included in your business continuity plan.
Communication Templates for Microsoft 365 Incidents
Clear communication prevents duplicate tickets, protects trust, and gives the technical team room to work. Keep templates short and factual. Avoid guessing at root cause before evidence supports it.
Executive update
Subject: Microsoft 365 incident update - [service] Current status: [Investigating / Microsoft incident confirmed / Workaround available / Resolved] Business impact: [who is affected and what they cannot do] Estimated scope: [number of users, department, location, or unknown] Current action: [what IT/MSP/Microsoft is doing] Workaround: [web app, mobile app, alternate channel, defer action, none] Risk notes: [security, compliance, customer impact, data loss, or none identified] Next update: [time]
Employee update
We are aware of an issue affecting [Microsoft 365 service]. What you may notice: [plain-language symptom] Who is affected: [scope] What to do now: [workaround or guidance] What not to do: Please do not repeatedly retry password resets, change settings, or submit duplicate tickets unless your symptom is different. Next update: [time]
Resolution update
The Microsoft 365 issue affecting [service/workflow] has been resolved or mitigated. Impact window: [start time] to [end time] Resolution: [Microsoft fix, tenant change, network fix, workaround removed] Next steps: We will review the timeline, confirm any recurring tickets are closed, and document prevention actions. If you still see the issue, please restart the affected app and submit a ticket with the exact error message.
Common Mistakes to Avoid
Waiting for Microsoft before communicating
You can acknowledge symptoms before root cause is known. Say what is affected, what is being checked, and when the next update will arrive.
Ignoring ticket patterns
Three similar tickets in 10 minutes are a signal. Build escalation rules that convert patterns into incident review quickly.
Assuming it is not security
Sign-in failures, MFA prompts, mailbox anomalies, and suspicious rules need security review, not just availability troubleshooting. If MFA controls are part of the issue, review how to add conditional access the right way.
Skipping recovery planning
Service restoration is not the same as data recovery. Tie Microsoft 365 incident response to backup, retention, and business continuity testing.
When to Bring in a Managed IT Partner
Microsoft 365 operations can become too large for a small internal team to manage alone, especially when the same people are responsible for helpdesk tickets, endpoint issues, cybersecurity, licensing, vendor management, backups, and executive reporting.
A managed IT partner can help when:
- Your team finds out about Microsoft 365 incidents from users before monitoring catches them.
- Ticket volume spikes every time Outlook, Teams, SharePoint, or OneDrive behaves differently.
- Service Health is checked manually, inconsistently, or only after leadership escalates.
- There is no documented Microsoft 365 incident playbook, no severity model, and no communication cadence.
- Identity, backup, security, and Microsoft 365 administration are owned by the same overloaded person.
- Your current provider is reactive, slow to communicate, or unable to explain recurring Microsoft 365 issues. The transition checklist for switching MSPs can help you evaluate whether the problem is service quality, coverage, or operating model.
MSP Corp’s managed IT services help Canadian organizations monitor systems, coordinate Microsoft 365 response, manage helpdesk workflows, improve security, and build clearer processes for recurring incidents. If after-hours coverage is part of the gap, clarify what 24/7 IT support includes and what it does not before you need it during an outage.
Make Microsoft 365 operations calmer, clearer, and safer.
Get a managed IT discovery call focused on your Microsoft 365 service-health process, helpdesk escalation paths, backup readiness, security controls, and recurring productivity issues.
Microsoft 365 Service Health Checklist
Use this checklist to convert the playbook into an operating routine.
Daily
- Review Service Health for active incidents and advisories affecting core services.
- Review ticket spikes by workload, location, and user group.
- Check failed sign-ins and identity-related anomalies for business-critical users.
- Confirm backup communication channels are available outside Teams and Exchange.
Weekly
- Review Message Center for admin-impact, user-impact, major update, retirement, and data privacy posts.
- Assign owners to relevant Microsoft 365 changes and update the change calendar.
- Review network connectivity insights for office or remote-user patterns.
- Audit unresolved Microsoft 365 tickets for repeat symptoms.
Monthly
- Review recurring incidents and complete root-cause actions.
- Validate admin roles and least-privilege assignments.
- Test a Microsoft 365 incident communication template.
- Confirm backup and recovery owners, scope, and restore expectations.
Quarterly
- Run a tabletop exercise for Exchange Online, Teams, SharePoint, or identity outage scenarios.
- Review Microsoft 365 backup and retention requirements against business continuity needs.
- Update severity definitions, escalation contacts, and executive reporting templates.
- Review whether recurring Microsoft 365 operational load should be shifted to co-managed or fully managed IT support.
FAQ
How do I check Microsoft 365 service health?
Sign in to the Microsoft 365 admin center with an appropriate admin role, then go to Health > Service health. The dashboard shows current health state, active incidents, advisories, organization-specific issues, and issue history.1
What is the difference between an incident and an advisory?
Microsoft describes an incident as a critical issue where a service or major function is unavailable. An advisory usually means Microsoft is aware of a problem affecting some users, but the service remains available, the impact is limited, intermittent, or there is a workaround.1
Does Service Health show planned maintenance?
Can Service Health data be pulled into another dashboard?
Does a green Service Health dashboard mean Microsoft 365 is fine for my users?
Not always. A green dashboard means Microsoft has not identified a listed service issue that applies in that view. User experience can still be affected by local networks, DNS, VPN, firewalls, endpoints, browser state, identity policies, licenses, third-party tools, or tenant configuration.
Does Microsoft 365 Service Health replace backup?
No. Service Health is about platform health and service incidents. Backup and recovery address data protection, deletion, overwrite, encryption, and continuity scenarios. Microsoft 365 Backup is designed for restore scenarios across selected Microsoft 365 workloads, and Microsoft’s compliance guidance frames cloud compliance and data protection as a shared responsibility.9, 10
Who should own Microsoft 365 service-health response?
Ownership should be shared but clearly assigned. A Microsoft 365 admin or managed IT provider should monitor Service Health and tenant configuration. A helpdesk lead should manage ticket patterns. A communications lead should update employees and leaders. A security lead should rule out compromise when symptoms involve sign-ins, phishing, mailbox anomalies, or access changes.
Final Takeaway
Microsoft 365 Service Health is essential, but it is only the starting point. The stronger approach is a complete operating model: Service Health for Microsoft-side incidents, Message Center for planned change, network insights for connectivity, ticket trends for real user impact, identity and security review for suspicious symptoms, and documented communication templates for calm response.
When that model is in place, Microsoft 365 problems become easier to classify, easier to communicate, and easier to improve after the fact. Instead of asking “is Microsoft down?” your team can answer the better question: what is affected, who owns the response, what should users do now, and how do we prevent this from becoming a recurring issue?
Need a clearer Microsoft 365 operations plan?
MSP Corp can help you assess service-health monitoring, helpdesk escalation, Microsoft 365 administration, security controls, backup readiness, and incident response so your team can spend less time reacting and more time improving.
References
- Microsoft Learn: How to check Microsoft 365 service health.
- Microsoft Learn: Microsoft 365 Health dashboard overview.
- Microsoft Learn: About the Microsoft 365 Admin mobile app.
- Microsoft Learn: Working with service communications API in Microsoft Graph.
- Microsoft Learn: List service health issues in Microsoft Graph.
- Microsoft Learn: serviceHealthIssue resource type.
- Microsoft Learn: Message center in the Microsoft 365 admin center.
- Microsoft Learn: Network connectivity in the Microsoft 365 admin center.
- Microsoft Learn: Overview of Microsoft 365 Backup.
- Microsoft Learn: Compliance for Microsoft 365 for enterprise and shared responsibility.
- NIST SP 800-61 Rev. 2: Computer Security Incident Handling Guide.
- Google Site Reliability Engineering: Monitoring Distributed Systems.
- Atlassian: Incident communication best practices.
- Atlassian Statuspage: Incident communication tips.
- Microsoft Learn: About administrator roles in the Microsoft 365 admin center.