Microsoft 365 Service Health Monitoring Playbook

Q: What is the difference between an incident and an advisory?

An incident is a critical issue where a service or major function is unavailable. An advisory usually means Microsoft is aware of a problem affecting some users, but the service remains available, the impact is limited, intermittent, or there is a workaround.

Q: Does Service Health show planned maintenance?

No. Planned maintenance events are not shown in Service Health. Use Message Center to track planned maintenance, major updates, retirements, and changes that may require administrator action.

Microsoft 365 Operations

A practical playbook for monitoring Microsoft 365 service health, separating Microsoft-side incidents from tenant-side problems, communicating clearly, and restoring business productivity faster.

14 minute read

Detect Use Microsoft Service health, user signals, network insights, and support tickets together.

Triage Separate global incidents, tenant misconfigurations, local network issues, and security events.

Communicate Give leaders, helpdesk, and employees short updates with impact, workaround, and next update time.

Improve Convert every incident into owner, timeline, root cause, control gaps, and prevention actions.

Do not let a Microsoft 365 issue become a business outage.

MSP Corp helps Canadian organizations monitor Microsoft 365, reduce recurring support tickets, escalate the right issues, and build a safer operating model for email, Teams, SharePoint, OneDrive, identities, and backups.

Request a Quote Review the admin checklist

When Outlook will not send, Teams calls drop, SharePoint feels slow, or users cannot sign in, the first question is simple: is Microsoft 365 down, or is something broken inside our own environment?

The answer is not always obvious. Microsoft 365 is a cloud service, but your real user experience depends on more than Microsoft’s platform. It also depends on identity configuration, licensing, endpoints, network routing, DNS, browser state, security controls, third-party integrations, local internet providers, and how quickly your team recognizes patterns across tickets.

This playbook explains how to use Microsoft 365 service health as part of a complete monitoring and response process. The goal is not just to know whether Microsoft has posted an advisory. The goal is to reduce confusion, protect productivity, communicate with confidence, and make sure every Microsoft 365 incident produces a better operating model.

What Microsoft 365 Service Health Actually Shows

Microsoft 365 Service Health is available in the Microsoft 365 admin center under Health > Service health. Microsoft describes it as the place to check known cloud-service problems across services such as Exchange Online, Microsoft Teams, Office on the web, Microsoft Dynamics 365, and other subscribed services before you spend time troubleshooting or opening a support case.¹

The Service health page gives administrators a current view of active incidents and advisories. Microsoft also distinguishes between Microsoft-side items it is working on and issues detected in your environment that may require your organization to act.^{1, 2}

Plain-English definition: Microsoft 365 Service Health is the official tenant-aware health dashboard for Microsoft cloud services. It helps you confirm whether a known Microsoft incident or advisory may explain a user-impacting issue. It does not replace endpoint monitoring, network monitoring, identity monitoring, backup strategy, or incident response ownership.

Service Health is most useful when you treat it as one signal in a broader operations workflow. It can answer questions like:

Is there an active Microsoft incident affecting Exchange Online, Teams, SharePoint, OneDrive, Microsoft Defender, Purview, or another service?
Is the issue an incident, where a service or major function is unavailable, or an advisory, where impact is limited, intermittent, or has a workaround?¹
Has Microsoft posted a status update, mitigation step, service-restored notice, or post-incident report?^{1, 6}
Is there a tenant-specific issue your organization must act on?
Is a reported symptom unrelated to Microsoft and more likely caused by local network, device, browser, DNS, identity, or policy changes?

Why Service Health Alone Is Not Enough

A healthy Microsoft dashboard does not always mean your users are healthy. The reverse is also true: an active Microsoft advisory does not always explain every ticket. A mature Microsoft 365 operations program uses several layers of evidence.

Microsoft signals

Service Health, Message Center, Microsoft 365 Health dashboard, service issue history, Microsoft support updates, and Microsoft Graph service communications.

Tenant signals

Sign-in failures, conditional access changes, license assignment problems, mail-flow changes, Teams policies, SharePoint permissions, and admin audit events.

Network signals

Microsoft 365 network connectivity assessments, ISP routing, DNS, VPN, proxy, firewall inspection, Wi-Fi performance, and remote-user location patterns.⁸

User signals

Ticket spikes, repeated symptoms, VIP reports, contact-centre issues, failed meetings, delayed email, Teams call quality complaints, and helpdesk trend data.

Google’s Site Reliability Engineering guidance recommends focusing monitoring on user-facing symptoms such as latency, traffic, errors, and saturation when you can only track a small set of metrics.¹² For Microsoft 365 operations, the same mindset applies: measure what users experience, not only what the admin portal reports.

Professional using a laptop in a modern office, representing Microsoft 365 managed services monitoring and support — Microsoft 365 service health is most valuable when it is connected to daily operations: ticket trends, identity controls, user communication, escalation paths, and continuous improvement.

The Microsoft 365 Service Health Monitoring Model

For small and mid-sized organizations, the practical model is simple: centralize signals, classify impact, assign ownership, and communicate before rumours fill the gap. The table below gives IT leaders and operations teams a working structure.

Monitoring layer	What to watch	Why it matters	Owner
Microsoft 365 Service Health	Active incidents, advisories, issue history, service-restored notices, post-incident reports	Confirms whether Microsoft is already investigating a known issue and reduces wasted troubleshooting time.¹	Microsoft 365 admin or managed IT provider
Microsoft 365 Health dashboard	Health snapshot, service health for top apps, security best-practice signals, desktop update posture	Gives administrators an executive-level view of the Microsoft 365 environment, not only outages.²	IT manager or global reader
Message Center	Planned changes, retirements, admin-impact posts, data privacy messages, major updates	Planned maintenance is not shown in Service Health, so Message Center is needed for proactive change management.^{1, 7}	Microsoft 365 operations lead
Microsoft Graph service communications	Service issues, health data, Message Center posts, and incident reports via API	Allows service-health data to feed dashboards, ticketing workflows, or managed-service reporting.^{4, 5, 6}	Automation owner or MSP
Network connectivity	Location-level network assessments, remote versus onsite patterns, Microsoft 365 connectivity issues	Helps identify whether poor performance comes from local network design or internet path problems.⁸	Network administrator
Helpdesk and user experience	Ticket spikes, affected locations, VIP impact, repeated symptoms, failed workflows	Shows real business impact and can reveal incidents before any official advisory appears.	Helpdesk lead

Key takeaway: Microsoft 365 Service Health tells you what Microsoft knows. Your monitoring model tells you what your business is experiencing.

Roles: Who Owns What During a Microsoft 365 Incident?

When email, collaboration, or identity breaks, the fastest teams do not start by debating ownership. They already know who is coordinating, who is diagnosing, who is communicating, and who can approve business workarounds.

Role	Responsibilities	Best-fit person
Incident lead	Owns severity, timeline, next-update cadence, decisions, and escalation.	IT manager, operations leader, or MSP service lead
Microsoft 365 admin	Checks Service Health, tenant configuration, Message Center, admin audit trails, and support cases.	Global reader, service support admin, or delegated MSP admin
Helpdesk lead	Tracks user reports, confirms scope, identifies patterns, and updates ticket templates.	Internal IT coordinator or service desk manager
Communications lead	Publishes employee, executive, and customer-facing updates where appropriate.	Operations, HR, communications, or service owner
Security lead	Rules out credential theft, phishing, suspicious sign-ins, malicious inbox rules, or data exposure.	Security owner, vCISO, or managed detection and response team
Business approver	Approves temporary workarounds when productivity, compliance, or customer commitments are affected.	COO, department head, or executive sponsor

Keep permissions tight. Microsoft recommends using the fewest permissions needed and limiting the number of users with administrative privileges. Several Microsoft 365 roles can monitor service health, including Helpdesk Administrator and Service Support Administrator, without giving everyone global administrator access.¹⁵

The Response Playbook

Use this workflow whenever there is a Microsoft 365 service-health concern, a spike in related tickets, or a business-critical symptom involving Exchange Online, Teams, SharePoint, OneDrive, Entra ID, Defender, Purview, or the Microsoft 365 admin center.

Prepare before the first ticket arrives

Preparation is the difference between a calm response and a noisy scramble. NIST’s incident handling guidance emphasizes preparation, detection and analysis, containment, eradication and recovery, and post-incident activity as core incident-handling phases.¹¹ For Microsoft 365 operations, preparation should include:

Named owners for Service Health, Message Center, Microsoft support, helpdesk coordination, and internal communication.
Saved admin center URLs, escalation contacts, Microsoft support access, and backup communication channels.
Admin roles configured with least privilege, including service-health visibility for the right team members.
Email and Teams templates ready for employee, executive, and helpdesk updates.
A standing process for Microsoft 365 changes using Message Center and your weekly, monthly, and quarterly administration tasks.

Detect and confirm the symptom

Start with the user symptom, not the assumed cause. Document the first report, affected workload, affected location, number of users, device type, network path, error message, and time the issue started.

First 10 questions to answer:

Which workload is affected: Outlook, Exchange Online, Teams, SharePoint, OneDrive, Entra ID, Defender, Purview, or admin center?
Is the issue affecting one user, one department, one office, remote users, or everyone?
Did the issue start after a known change, policy update, license change, network change, or security control update?
Are users seeing the same error message?
Are mobile, desktop, and web clients affected the same way?
Can a test account reproduce the issue?
Does Microsoft 365 Service Health show an active incident or advisory?
Does Message Center show a recent rollout, retirement, or admin-impact change?
Do sign-in logs or conditional access reports show unusual blocks?
Is there any sign this could be a security incident rather than a service outage?

Check Microsoft 365 Service Health

Open the Microsoft 365 admin center and go to Health > Service health. Review the overview, active issues, organization-specific issues, and issue history. Microsoft says administrators can also report an issue from the Service Health page when they experience a service problem that is not listed, allowing Microsoft to compare signals across organizations.¹

If the admin center itself is inaccessible, Microsoft advises using the public service status page and following Microsoft 365 Status communications for certain events.¹ Keep a backup communication path for your team so you are not dependent on Teams or Exchange during a Microsoft 365 interruption.

Classify the incident

Classify the issue before escalating. This reduces panic, helps leadership understand impact, and keeps the helpdesk from applying inconsistent workarounds.

Severity	Definition	Examples	Communication cadence
Sev 1	Business-critical service unavailable for most users or a regulated operation.	Email down company-wide, Teams meetings failing across departments, widespread sign-in failure.	Initial update within 15 minutes, then every 30 to 60 minutes.
Sev 2	Major workflow degraded for a group, office, or critical department.	SharePoint unavailable for finance close, OneDrive sync failure for a project team, executive mailbox issue.	Initial update within 30 minutes, then every 60 to 90 minutes.
Sev 3	Localized or intermittent issue with workaround available.	Teams call quality degraded for one office, Outlook desktop issue resolved by web app.	Status update when confirmed, then at major milestones.
Sev 4	Informational, planned change, or minor degradation.	Feature rollout, advisory with limited impact, planned change from Message Center.	Normal change communication.

Separate Microsoft-side, tenant-side, network-side, and security-side causes

Do not stop at the first plausible explanation. A Microsoft advisory might be real, while your organization also has a conditional access misconfiguration, expired certificate, DNS issue, or compromised account. Your triage should sort the issue into one of four lanes:

Microsoft-side: Confirmed active incident or advisory for a Microsoft service.
Tenant-side: Configuration, policy, license, user, mail-flow, retention, endpoint, or admin change inside your tenant.
Network-side: Office, ISP, VPN, proxy, DNS, firewall inspection, or routing problem affecting Microsoft 365 connectivity.
Security-side: Identity compromise, phishing, risky sign-ins, malicious inbox rules, data exfiltration, or suspicious OAuth app behaviour.

When the symptoms involve sign-ins, unusual prompts, mailbox rules, or unexpected access blocks, treat the event as a potential security incident until ruled out. The related playbook for an alert-to-action triage workflow can help teams move faster without skipping evidence.

Communicate early, briefly, and consistently

Atlassian’s incident communication guidance stresses acknowledging the issue quickly, summarizing known impact, promising further updates, and communicating consistently across channels.^{13, 14} For Microsoft 365 issues, the first message should not over-explain. It should calm the business and stop duplicate tickets.

Subject: Microsoft 365 service issue under investigation

We are investigating reports of [symptom] affecting [users, department, location, or workload].

Current impact: [plain-language impact]
Known workaround: [workaround or "none confirmed yet"]
What we are checking: Microsoft 365 Service Health, tenant configuration, network path, and recent changes.
Next update: [time]

Please avoid submitting duplicate tickets unless you have a different symptom or business-critical impact.

Escalate with useful evidence

If Microsoft has posted an incident, track the issue ID and updates. If Microsoft has not posted an incident but you have repeatable evidence, open a support request with clear impact, timestamps, affected services, affected users, diagnostic steps, screenshots, and correlation IDs where available.

If you use automation, Microsoft Graph service communications APIs can retrieve service health issues and Message Center communications for a tenant, including current and historical health data and service messages, with the required admin-granted permissions.^{4, 5}

Restore service, then run the post-incident review

When the issue is resolved, send a short resolution note, capture the timeline, update the ticket record, and identify prevention work. Microsoft Service Health can show service-restored status and post-incident reports when published for specific issues.^{1, 6}

For recurring incidents, use a structured RCA process instead of treating every outage as a one-off. Capture what failed, why detection did or did not work, how long communication took, what workaround was used, and which control will prevent recurrence.

Workload-Specific Triage: What to Check First

Microsoft 365 problems feel different depending on the workload. Use the quick checks below to avoid wasting time in the wrong admin centre.

Workload	User symptom	First checks	Likely lanes
Exchange Online and Outlook	Email delayed, cannot send, mailbox unavailable, Outlook search broken	Service Health, Exchange message trace, mailbox access, transport rules, Outlook web access, mobile versus desktop comparison	Microsoft-side, tenant-side, security-side
Microsoft Teams	Calls fail, chat delayed, meetings unavailable, poor call quality	Service Health, Teams admin center, call quality dashboard, network path, VPN and firewall inspection, affected meeting region	Microsoft-side, network-side, tenant-side
SharePoint and OneDrive	Files unavailable, sync failures, permissions errors, slow libraries	Service Health, SharePoint admin center, OneDrive sync health, permissions, storage, recent sharing or retention changes	Microsoft-side, tenant-side, network-side
Microsoft Entra ID	Sign-in failures, repeated MFA prompts, conditional access blocks, app access denied	Sign-in logs, conditional access reports, identity protection, recent policy changes, service health	Tenant-side, security-side, Microsoft-side
Defender and Purview	Portal unavailable, compliance workflow delayed, alerts missing, policies not applying	Service Health, Message Center, licensing, role permissions, policy changes, incident queue	Microsoft-side, tenant-side
Admin center	Admins cannot load portal, settings fail, support page inaccessible	Public service status, alternate admin account, different browser/network, admin mobile app, Microsoft support access	Microsoft-side, network-side, identity-side

Planned Changes: Use Message Center, Not Service Health

One of the most common Microsoft 365 operations mistakes is treating Service Health as the only health source. Microsoft notes that planned maintenance events are not shown in Service Health. Planned maintenance, feature changes, retirements, data privacy messages, and admin-impact updates belong in Message Center.^{1, 7}

That matters because many “outages” are really change-management failures. A feature rollout changes a workflow. A retirement breaks an integration. A security setting becomes available but nobody assigns an owner. A Teams or Outlook update changes user behaviour, and the helpdesk is surprised.

Build a weekly Message Center review into your Microsoft 365 operating rhythm. Filter for admin impact, user impact, major updates, retirements, and data privacy. Assign owners, due dates, and communication tasks. For AI-related changes, connect Message Center review with AI governance and change-control approvals, especially when Copilot, Graph-connected data, or new app permissions are involved.

Network Performance: The Hidden Microsoft 365 Service Health Variable

If Microsoft 365 is technically healthy but users experience slow Outlook, choppy Teams meetings, or sluggish SharePoint access, look at the network path. Microsoft 365 admin center network connectivity provides tenant-specific network metrics and location-level assessments to help administrators identify network architecture issues and insights.⁸

Watch for local patterns

If all remote users are fine but one office is slow, you may have a firewall, DNS, ISP, Wi-Fi, VPN, or proxy issue. If one department is affected across locations, you may have a policy, permission, licensing, or app-specific issue. If everyone is affected across networks, Service Health and Microsoft support become higher-priority checks.

For hybrid or multi-site organizations, Microsoft 365 monitoring should connect with network standards, firewall reviews, VPN or ZTNA decisions, and office design. If remote access architecture is part of the problem, compare the options in a ZTNA versus VPN migration strategy before making another short-term exception.

Do Not Confuse Service Health With Backup or Business Continuity

Microsoft 365 Service Health helps you understand platform availability. It does not guarantee your organization can recover from accidental deletion, malicious deletion, ransomware encryption, misconfigured retention, or user-driven data loss.

Microsoft’s compliance documentation describes cloud security and compliance as a shared responsibility, with Microsoft responsible for its cloud-service obligations and customers responsible for protecting data in a way that satisfies their own compliance requirements.¹⁰ Microsoft 365 Backup is designed to protect and restore selected SharePoint, OneDrive, and Exchange data and emphasizes recovery to a healthy state as a core business-continuity concern.⁹

That means Microsoft 365 service monitoring should connect to backup, retention, and continuity planning. For practical next steps, review what Microsoft covers versus what your organization still needs and make sure your recovery process is included in your business continuity plan.

Communication Templates for Microsoft 365 Incidents

Clear communication prevents duplicate tickets, protects trust, and gives the technical team room to work. Keep templates short and factual. Avoid guessing at root cause before evidence supports it.

Executive update

Subject: Microsoft 365 incident update - [service]

Current status: [Investigating / Microsoft incident confirmed / Workaround available / Resolved]
Business impact: [who is affected and what they cannot do]
Estimated scope: [number of users, department, location, or unknown]
Current action: [what IT/MSP/Microsoft is doing]
Workaround: [web app, mobile app, alternate channel, defer action, none]
Risk notes: [security, compliance, customer impact, data loss, or none identified]
Next update: [time]

Employee update

We are aware of an issue affecting [Microsoft 365 service].

What you may notice: [plain-language symptom]
Who is affected: [scope]
What to do now: [workaround or guidance]
What not to do: Please do not repeatedly retry password resets, change settings, or submit duplicate tickets unless your symptom is different.

Next update: [time]

Resolution update

The Microsoft 365 issue affecting [service/workflow] has been resolved or mitigated.

Impact window: [start time] to [end time]
Resolution: [Microsoft fix, tenant change, network fix, workaround removed]
Next steps: We will review the timeline, confirm any recurring tickets are closed, and document prevention actions.

If you still see the issue, please restart the affected app and submit a ticket with the exact error message.

Common Mistakes to Avoid

Waiting for Microsoft before communicating

You can acknowledge symptoms before root cause is known. Say what is affected, what is being checked, and when the next update will arrive.

Ignoring ticket patterns

Three similar tickets in 10 minutes are a signal. Build escalation rules that convert patterns into incident review quickly.

Assuming it is not security

Sign-in failures, MFA prompts, mailbox anomalies, and suspicious rules need security review, not just availability troubleshooting. If MFA controls are part of the issue, review how to add conditional access the right way.

Skipping recovery planning

Service restoration is not the same as data recovery. Tie Microsoft 365 incident response to backup, retention, and business continuity testing.

When to Bring in a Managed IT Partner

Microsoft 365 operations can become too large for a small internal team to manage alone, especially when the same people are responsible for helpdesk tickets, endpoint issues, cybersecurity, licensing, vendor management, backups, and executive reporting.

A managed IT partner can help when:

Your team finds out about Microsoft 365 incidents from users before monitoring catches them.
Ticket volume spikes every time Outlook, Teams, SharePoint, or OneDrive behaves differently.
Service Health is checked manually, inconsistently, or only after leadership escalates.
There is no documented Microsoft 365 incident playbook, no severity model, and no communication cadence.
Identity, backup, security, and Microsoft 365 administration are owned by the same overloaded person.
Your current provider is reactive, slow to communicate, or unable to explain recurring Microsoft 365 issues. The transition checklist for switching MSPs can help you evaluate whether the problem is service quality, coverage, or operating model.

MSP Corp’s managed IT services help Canadian organizations monitor systems, coordinate Microsoft 365 response, manage helpdesk workflows, improve security, and build clearer processes for recurring incidents. If after-hours coverage is part of the gap, clarify what 24/7 IT support includes and what it does not before you need it during an outage.

Make Microsoft 365 operations calmer, clearer, and safer.

Get a managed IT discovery call focused on your Microsoft 365 service-health process, helpdesk escalation paths, backup readiness, security controls, and recurring productivity issues.

Request a Quote Explore end-user IT support

Microsoft 365 Service Health Checklist

Use this checklist to convert the playbook into an operating routine.

Daily

Review Service Health for active incidents and advisories affecting core services.
Review ticket spikes by workload, location, and user group.
Check failed sign-ins and identity-related anomalies for business-critical users.
Confirm backup communication channels are available outside Teams and Exchange.

Weekly

Review Message Center for admin-impact, user-impact, major update, retirement, and data privacy posts.
Assign owners to relevant Microsoft 365 changes and update the change calendar.
Review network connectivity insights for office or remote-user patterns.
Audit unresolved Microsoft 365 tickets for repeat symptoms.

Monthly

Review recurring incidents and complete root-cause actions.
Validate admin roles and least-privilege assignments.
Test a Microsoft 365 incident communication template.
Confirm backup and recovery owners, scope, and restore expectations.

Quarterly

Run a tabletop exercise for Exchange Online, Teams, SharePoint, or identity outage scenarios.
Review Microsoft 365 backup and retention requirements against business continuity needs.
Update severity definitions, escalation contacts, and executive reporting templates.
Review whether recurring Microsoft 365 operational load should be shifted to co-managed or fully managed IT support.

FAQ

How do I check Microsoft 365 service health?

Sign in to the Microsoft 365 admin center with an appropriate admin role, then go to Health > Service health. The dashboard shows current health state, active incidents, advisories, organization-specific issues, and issue history.¹

What is the difference between an incident and an advisory?

Microsoft describes an incident as a critical issue where a service or major function is unavailable. An advisory usually means Microsoft is aware of a problem affecting some users, but the service remains available, the impact is limited, intermittent, or there is a workaround.¹

Does Service Health show planned maintenance?

No. Microsoft notes that planned maintenance events are not shown in Service Health. Use Message Center to track planned maintenance, major updates, retirements, and changes that may require administrator action.^{1, 7}

Can Service Health data be pulled into another dashboard?

Yes. Microsoft Graph service communications APIs provide access to service health and Message Center communications for subscribed Microsoft cloud services, subject to required permissions. The API can retrieve tenant service-health issues and service messages.^{4, 5}

Does a green Service Health dashboard mean Microsoft 365 is fine for my users?

Not always. A green dashboard means Microsoft has not identified a listed service issue that applies in that view. User experience can still be affected by local networks, DNS, VPN, firewalls, endpoints, browser state, identity policies, licenses, third-party tools, or tenant configuration.

Does Microsoft 365 Service Health replace backup?

No. Service Health is about platform health and service incidents. Backup and recovery address data protection, deletion, overwrite, encryption, and continuity scenarios. Microsoft 365 Backup is designed for restore scenarios across selected Microsoft 365 workloads, and Microsoft’s compliance guidance frames cloud compliance and data protection as a shared responsibility.^{9, 10}

Who should own Microsoft 365 service-health response?

Ownership should be shared but clearly assigned. A Microsoft 365 admin or managed IT provider should monitor Service Health and tenant configuration. A helpdesk lead should manage ticket patterns. A communications lead should update employees and leaders. A security lead should rule out compromise when symptoms involve sign-ins, phishing, mailbox anomalies, or access changes.

Final Takeaway

Microsoft 365 Service Health is essential, but it is only the starting point. The stronger approach is a complete operating model: Service Health for Microsoft-side incidents, Message Center for planned change, network insights for connectivity, ticket trends for real user impact, identity and security review for suspicious symptoms, and documented communication templates for calm response.

When that model is in place, Microsoft 365 problems become easier to classify, easier to communicate, and easier to improve after the fact. Instead of asking “is Microsoft down?” your team can answer the better question: what is affected, who owns the response, what should users do now, and how do we prevent this from becoming a recurring issue?

Need a clearer Microsoft 365 operations plan?

MSP Corp can help you assess service-health monitoring, helpdesk escalation, Microsoft 365 administration, security controls, backup readiness, and incident response so your team can spend less time reacting and more time improving.

Request a Quote Talk to an Expert

Upcoming Webinar > Beyond Traditional Managed Services: What Modern IT Partners Must Deliver

Microsoft 365 Service Health: Monitoring and Response Playbook