Vendor Monitoring Guide

Your application runs on code you wrote and services you did not. Every payment processed through Stripe, every message sent through Slack, every file stored on AWS depends on someone else's infrastructure staying online. When it does not, your customers feel it. They do not care that the root cause is three layers deep in a vendor's stack. They care that your product is broken.

Vendor and service status monitoring is the practice of tracking those dependencies so you find out about problems before your customers do. This guide covers everything: how status pages actually work (and where they fall short), how to detect outages yourself, how to respond when a vendor goes down, and how to build systems that survive third-party failures.

Whether you are an engineer debugging a mysterious failure, an ops lead managing incident response, or a founder trying to keep your product reliable, this is the reference you need.

Why Vendor Monitoring Matters

The average business now uses well over 100 SaaS applications [1]. Each one is a dependency. Each one can fail. And when it does, your business absorbs the impact.

This is not a theoretical risk. In June 2021, a single configuration error at Fastly took down major portions of the internet for nearly an hour, knocking Amazon, Reddit, the New York Times, and thousands of other sites offline [2]. In December 2021, a multi-hour AWS us-east-1 outage disrupted everything from Disney+ to Slack to Roomba vacuum cleaners [3]. These were not obscure services. They were critical infrastructure that millions of businesses depended on.

The question is not whether your vendors will have outages. They will. The question is whether you will know about it in time to do something useful.

The Dependency Problem

A decade ago, most of your critical systems ran on your own servers. You controlled the hardware, the network, the software stack. If something broke, you could fix it. Today, a typical web application might depend on Stripe for payments, AWS for hosting, Cloudflare for CDN, SendGrid for email, Twilio for SMS, Algolia for search, and Datadog for monitoring. That is seven external dependencies before you even count the tools your team uses internally.

Each dependency multiplies your exposure to outages. If each vendor has 99.9% uptime (roughly 8.7 hours of downtime per year), and you have 10 independent dependencies, your probability of experiencing at least one vendor outage in any given month is substantial. The math works against you as your dependency count grows.

The Cost Is Real

Gartner estimates the average cost of IT downtime at $5,600 per minute [4]. That figure varies enormously by industry and company size, but even for a small SaaS business, vendor outages carry meaningful costs. Lost revenue during payment processor outages. Lost productivity when collaboration tools go down. Lost customer trust when your product appears unreliable because a dependency failed.

The real cost of vendor downtime extends beyond the obvious. Support tickets spike. Engineering time gets burned on diagnosis. Customers who were about to convert abandon the process. And if outages happen repeatedly, customers start looking at alternatives.

The Problem with Status Pages

The first thing most people do when something seems broken is check the vendor's status page. It sounds reasonable. The vendor should know when their own service is down, right?

Sometimes. But status pages have fundamental problems that make them unreliable as your only source of truth.

Status Pages Are Slow to Update

Vendors do not update their status pages the moment an issue begins. There is always a delay. The vendor's monitoring has to detect the problem. An engineer has to confirm it. Someone has to decide how to characterize it on the status page. In many organizations, the status page update requires approval from a communications or PR team.

During the June 2021 Fastly outage, the company's status page took roughly 20 minutes to reflect an issue that was immediately obvious to anyone trying to load a website [2]. Twenty minutes is an eternity when your customers are staring at error messages.

Vendors Underreport Problems

Status pages are a public communication channel. Vendors know that investors, prospective customers, and the press read them. This creates an incentive to minimize the apparent severity of incidents.

You will see phrases like "investigating increased error rates" when the service is completely unavailable for a significant portion of users. You will see "degraded performance" when API response times have gone from 200ms to 30 seconds. The language is carefully chosen to sound measured and controlled, even when the reality is messy.

Regional and Partial Outages Get Missed

Many status pages report a single global status. If the service is degraded only in the EU-West region, or only for customers on a particular plan tier, the status page might show green while your users are experiencing failures. This is especially common with cloud infrastructure providers that have dozens of services across multiple regions.

During the AWS us-east-1 outage in December 2021, the AWS status page itself was partially broken because the system that updates it was hosted in the affected region [3]. The status page was literally unable to report the outage accurately because it was caught in the outage.

What Status Pages Are Good For

Despite these limitations, status pages are not useless. They are the canonical source for a vendor's official incident communications. Once an incident is acknowledged, status pages provide structured updates, root cause information, and resolution timelines. They are also the place where vendors post scheduled maintenance windows.

The key is to treat status pages as one input among several, not as your single source of truth. For a deeper look at how to evaluate what you are reading, see how to check if a service is down.

How to Check if a Service Is Down

When something in your stack stops working and you need to figure out whether it is your code or your vendor, you have a few options. Each has tradeoffs.

Check the Vendor's Status Page

The obvious first step. Go to the vendor's status page (usually at status.vendor.com or vendor.statuspage.io) and look for active incidents. If there is one, you have your answer. If the page shows all green, that does not necessarily mean the service is healthy, but it does mean the vendor has not acknowledged a problem yet.

Search Social Media

Twitter (now X) and Reddit are often the fastest sources for outage confirmation. Search for the vendor name plus "down" or "outage" and sort by recent. If hundreds of people are posting about the same problem in the last few minutes, you can be fairly confident the issue is real and widespread.

The downside is noise. People complain about services on social media constantly, so you need to distinguish between a genuine outage spike and normal background chatter.

Use Third-Party Status Checkers

Tools like Is That Down aggregate status information from multiple sources, giving you a faster and more complete picture than any single status page provides. Instead of checking five different status pages and three social media platforms, you get a consolidated view of whether a service is experiencing problems.

This is especially valuable when you depend on multiple vendors. Checking each one individually is tedious and slow. A monitoring tool that watches all of them continuously and alerts you when something changes saves real time during incidents.

Check From Multiple Locations

If you suspect a regional issue, test the service from different geographic locations. Tools like ping, traceroute, and online services like check-host.net can help you determine whether a service is unreachable globally or only from certain networks.

Monitor It Yourself

For your most critical dependencies, consider running your own health checks. A simple script that hits the vendor's API every minute and measures response time and error rates gives you ground truth that no status page can match. If your health check shows the API returning 500 errors, you know there is a problem regardless of what the status page says.

Crowdsourced Outage Detection

Crowdsourced platforms like Downdetector have become popular for outage detection. The concept is straightforward: users report when they are having problems, and the platform aggregates those reports to identify outages. When the report volume spikes above a baseline, the platform declares an outage.

How It Works

Downdetector and similar platforms collect reports from users through their website, mobile app, and social media monitoring. They establish a baseline report rate for each service and flag anomalies when reports spike significantly above that baseline. The result is a heat map and graph showing report volume over time.

The Strengths

Crowdsourced detection can be genuinely fast. When a major service goes down and millions of users are affected, the spike in reports shows up within minutes. For big, obvious outages of consumer-facing services, crowdsourced platforms often reflect the problem before the vendor's status page does.

The Limitations

Crowdsourced detection has significant blind spots. It works well for consumer services with millions of users (Gmail, Netflix, PlayStation Network) but poorly for B2B services with smaller user bases. If your critical dependency is a niche API with 10,000 customers, Downdetector is not going to help.

False positives are also a problem. A viral tweet complaining about a service can generate a spike in reports even when the service is operating normally. Conversely, intermittent issues that affect a small percentage of users may never generate enough reports to trigger detection.

Crowdsourced platforms also cannot tell you what is wrong. They can tell you that "something seems off with Cloudflare" but not whether it is a DNS issue, a CDN issue, or a specific product within Cloudflare's suite. For operational response, you need more detail.

For a detailed comparison of crowdsourced vs. automated monitoring approaches, see Is That Down vs. Downdetector.

Mapping Your Dependencies

Before you can monitor your vendors effectively, you need to know what they are. This sounds obvious, but most organizations significantly underestimate their dependency count.

Direct Dependencies

Start with the services your application directly integrates with. These are typically listed in your codebase as API integrations, SDKs, or configured services. Payment processing, email delivery, hosting, CDN, database, search, analytics, authentication. Go through your codebase, your infrastructure configuration, and your .env files.

Indirect Dependencies

Your vendors have their own vendors. If your payment processor relies on a specific bank's API, and that bank has an outage, your payments will fail even though your payment processor's status page might initially show green. You generally cannot map these deeply, but you should be aware of the major ones. For example, a huge percentage of the internet relies on AWS, so an AWS outage affects services that are not themselves hosted on AWS but depend on other services that are.

Internal Tool Dependencies

Do not forget the tools your team uses. Slack, GitHub, Jira, Notion, Google Workspace, your CI/CD pipeline. These are not customer-facing dependencies, but they affect your ability to ship code, communicate, and respond to incidents. A GitHub outage during an active incident means you cannot merge a hotfix.

Building a Dependency Map

Create a simple document or spreadsheet that lists every external service your organization depends on. For each one, note:

What it does for you (payments, email, hosting, etc.)
How critical it is (can you operate without it for an hour? A day?)
What happens when it fails (graceful degradation, full outage, data loss?)
Who on your team needs to know when it has problems

This map becomes the foundation for your monitoring setup. It tells you what to watch, how urgently to alert, and who to notify. For guidance on evaluating your vendors, see choosing reliable SaaS vendors.

Types of Vendor Outages

Not all outages look the same, and the type of outage determines how you should respond.

Full Outage

The service is completely unavailable. API calls return errors. The website is down. Nothing works. These are the easiest to detect and the hardest to work around. The June 2021 Fastly CDN failure was a full outage for affected customers: sites either loaded or they did not [2].

Degraded Performance

The service is technically available but performing poorly. API response times spike from 200ms to 10 seconds. Error rates increase from 0.01% to 5%. Requests intermittently time out. These are insidious because they are harder to detect and harder to confirm. Your application might still technically work, but the user experience is terrible.

Regional Outage

The service is down in specific geographic regions but operational elsewhere. Cloud providers are particularly prone to this because of their multi-region architecture. The December 2021 AWS outage primarily affected us-east-1, so services hosted in other regions were largely unaffected [3]. If you are in the affected region, it is a full outage for you, even if the vendor's global status looks partially healthy.

Intermittent Failures

The service works sometimes and fails sometimes. These are the most frustrating type of outage to diagnose because the problem comes and goes. You might see 10% of requests failing, or failures that only affect certain API endpoints, or issues that appear for a few minutes and then resolve before appearing again.

Feature-Specific Outage

The service is mostly working, but a specific feature or API endpoint is broken. For example, Stripe's payment processing might work fine while their dashboard or reporting API is down. If you depend on the broken feature, it is an outage for you. If you do not, you might not even notice.

Setting Up Monitoring

With your dependency map in hand, you can set up monitoring that covers your actual risk.

What to Monitor

At minimum, monitor every service that would cause a customer-facing impact if it went down. That means your payment processor, your hosting provider, your CDN, your email delivery service, and any API your application calls during normal operation.

Also monitor the tools that would impair your ability to respond to incidents. If your team communicates via Slack and manages code on GitHub, monitor those too. You do not want to discover during an active incident that your communication tool is also down.

Check Frequency

How often you check depends on how quickly you need to know. For critical, customer-facing services, checking every one to five minutes is reasonable. For internal tools, every five to fifteen minutes is usually sufficient.

More frequent checking gives you faster detection but generates more noise if a service has brief blips that resolve on their own. Less frequent checking is quieter but means you might not know about a problem for several minutes after it starts.

Alert Routing

Not every outage needs to wake someone up at 3 AM. Set up tiered alerting based on severity and business impact.

For your most critical services (payment processing, hosting), alerts should go to the on-call engineer or operations lead immediately, via a high-urgency channel like PagerDuty or a phone call.

For important but not immediately critical services (analytics, internal tools), send alerts to a Slack channel or email. The team will see it when they are online.

For informational monitoring (services you use but could live without briefly), log the outage for later review but do not interrupt anyone.

Route vendor alerts through the same incident management system you use for your own infrastructure. When a vendor outage triggers customer-facing impact, you want it handled with the same urgency and process as an internal outage.

Monitoring Your Own Site Alongside Vendors

Vendor monitoring does not replace monitoring your own systems. You need both. When your application monitoring shows errors, vendor monitoring helps you determine whether the root cause is internal or external. When vendor monitoring shows an outage, your application monitoring tells you whether it is actually affecting your users.

For guidance on monitoring your own uptime, see what is uptime monitoring on our sister site.

Incident Response for Third-Party Outages

When a vendor goes down, your options are limited compared to an internal outage. You cannot fix the problem. You cannot deploy a patch. You are waiting on someone else. But that does not mean there is nothing to do.

Confirm the Outage

Before you spin up an incident response, confirm that the vendor is actually the problem. Check multiple sources: the vendor's status page, your own monitoring, third-party tools, and social media. Rule out the possibility that the issue is on your side (a misconfigured API key, a network issue in your infrastructure, a recent deployment that broke an integration).

Assess the Impact

Determine what parts of your application or business are affected. Is this customer-facing or internal only? Are all customers affected or just a subset? Can customers still use most of your product, or is the core experience broken?

This assessment drives every other decision: how urgently to communicate, who to escalate to, and whether to activate fallback systems.

Activate Fallbacks

If you have fallback systems or redundancies in place (covered in the resilience section below), now is the time to activate them. Switch to a backup payment processor. Route traffic through a secondary CDN. Enable the offline mode in your application.

If you do not have fallbacks, your options are more limited, but you might still be able to gracefully degrade. Disable the broken feature and show a helpful message instead of an error. Queue operations for retry when the vendor comes back. Give users a manual workaround.

Monitor for Resolution

Stay on top of the vendor's status updates. Subscribe to their incident updates if you have not already. When they report resolution, verify it with your own monitoring before declaring the incident over on your end.

Vendor-reported resolution does not always mean things are back to normal. After a major outage, there is often a period of recovery where queues drain, caches repopulate, and performance gradually returns to baseline. Test your own integration before telling your customers everything is fixed.

For a step-by-step framework you can use during live incidents, see the vendor outage response playbook.

Communicating Outages to Your Users

How you communicate during a vendor outage matters more than the outage itself. Your customers remember how you handled it long after they forget the details.

Be Proactive

Do not wait for customers to report problems. If you know a vendor outage is affecting your product, tell your customers before they notice. This is the single most impactful thing you can do. A proactive message ("We are aware of an issue affecting payments and are working to resolve it") builds far more trust than a reactive one ("Sorry you experienced that, we had an outage earlier").

Be Honest Without Blaming

Your customers do not need to know your entire vendor architecture, but they deserve honesty. "We are experiencing issues with our payment processing system" is better than "We are investigating reports of intermittent errors in some transaction workflows." The first is clear. The second is corporate evasion.

At the same time, avoid publicly blaming specific vendors by name in customer-facing communications unless the outage is already widely known and attributed. Focus on the impact and your response, not the root cause.

Communication Templates

Having pre-written templates dramatically speeds up your response. Here are the essential ones:

Initial acknowledgment: "We are aware of an issue affecting [specific feature]. Our team is investigating and we will provide updates every [30 minutes/1 hour]. [Feature] may be unavailable or slow during this time."

In-progress update: "The issue affecting [feature] is still ongoing. We have confirmed the cause and are working with our infrastructure provider to resolve it. Estimated time to resolution: [estimate or 'unknown at this time']."

Resolution: "The issue affecting [feature] has been resolved. All systems are operating normally. We apologize for the disruption. We will publish a full postmortem within [timeframe]."

Your Own Status Page

If you do not already have a status page for your own product, build one. It gives you a place to direct customers during incidents and a history of your reliability. Services like Statuspage.io, Instatus, or Cachet make this straightforward. Posting updates to your status page is faster and more scalable than responding to individual support tickets.

SLA Tracking and Vendor Accountability

Most SaaS vendors commit to specific uptime levels in their Service Level Agreements. Tracking whether they actually meet those commitments gives you leverage in vendor negotiations and data for make-or-break decisions about your tool stack.

Understanding SLAs

A typical SaaS SLA guarantees 99.9% uptime, which translates to roughly 8 hours and 46 minutes of allowed downtime per year. Some vendors offer 99.99% (52.6 minutes per year) or even 99.999% (5.3 minutes per year) for premium tiers.

The important details are usually in the fine print. How does the vendor measure uptime? Does scheduled maintenance count against the SLA? What constitutes a "qualifying outage"? What is the actual remedy if they miss their target? In many cases, the remedy is a service credit worth a small percentage of your monthly bill, not full compensation for your losses.

Tracking Uptime

Your vendor monitoring tool should log every incident it detects: when it started, how long it lasted, and how severe it was. Over time, this gives you an objective record of each vendor's reliability that you can compare against their SLA commitments.

This data is valuable in several scenarios. When negotiating contract renewals, you can show exactly how many outages a vendor had and how they compared to SLA targets. When evaluating whether to switch vendors, you have historical reliability data to compare. When justifying investment in redundancy or fallback systems, you have concrete outage data to support the business case.

Requesting Credits

When a vendor misses their SLA, file a credit request. Many organizations skip this because the process feels tedious or the credit amounts seem small. But SLA credits serve a purpose beyond their dollar value. They signal to the vendor that you are tracking their reliability and that continued outages have consequences. Vendors pay more attention to customers who hold them accountable.

Document every outage, including timestamps, duration, impact description, and any supporting evidence from your monitoring. Submit credit requests promptly. Most SLAs have a window (often 30 days) after the incident within which you must file.

Building Resilience

The best incident response is the one that does not require human intervention. Building resilience into your architecture means your systems can survive vendor outages automatically, or at least degrade gracefully.

Redundancy

For your most critical dependencies, consider having a backup. This does not mean paying for two of everything. It means identifying the services where an outage would have the highest impact and maintaining a secondary option you can switch to.

Payment processing is the classic example. If Stripe goes down and you also have a Braintree integration ready, you can switch payment flows and keep processing orders. The switch does not have to be instant or automatic. Even a manual failover that takes 15 minutes is better than being completely dead for two hours.

Graceful Degradation

Not every feature needs to work all the time. Design your application so that a dependency failure affects only the feature that depends on it, not your entire product.

If your search provider is down, show a message that search is temporarily unavailable instead of crashing the whole page. If your analytics service is unreachable, skip the tracking call instead of blocking page loads. If your email delivery service is down, queue messages for later delivery instead of failing the user action that triggered the email.

Circuit Breakers

Implement circuit breaker patterns in your code for external API calls. When a vendor starts returning errors, the circuit breaker "opens" and stops sending requests, falling back to a cached response or default behavior. This prevents a slow or failing vendor from dragging down your entire application with timeouts and retries.

Multi-Region Strategies

If your application serves a global audience, consider distributing your infrastructure across multiple cloud regions. A regional outage at your hosting provider will only affect the region that is down, and you can route traffic to healthy regions until the issue resolves.

This adds complexity, especially around data replication and consistency. But for applications where availability is critical, the tradeoff is worth it.

Caching

Aggressive caching of vendor responses reduces your real-time dependency on those vendors. If your application caches product catalog data from a vendor API, a brief API outage will not affect your users because they are seeing cached data. The cache goes stale eventually, but minutes of staleness is far better than minutes of errors.

Internet and Infrastructure Outages

Some outages are bigger than any single vendor. When core internet infrastructure fails, the impact is widespread and unpredictable.

DNS Failures

DNS translates domain names into IP addresses. When DNS fails, nothing works. The October 2021 Facebook outage was caused by a BGP routing change that made Facebook's DNS servers unreachable, taking down Facebook, Instagram, WhatsApp, and Oculus for roughly six hours [5]. Even Facebook's internal tools were affected, reportedly preventing engineers from accessing the systems they needed to fix the problem.

DNS outages can also happen at the provider level. If you use a third-party DNS provider and they have an outage, your domain stops resolving even though your servers are perfectly healthy. This is why some organizations use multiple DNS providers.

For a detailed explanation of how DNS monitoring works and why it matters, see DNS monitoring explained.

CDN Outages

Content delivery networks sit between your servers and your users. When a CDN goes down, your site may become slow or unreachable even though your origin servers are fine. The Fastly outage in June 2021 demonstrated this vividly: sites that depended on Fastly's CDN were completely inaccessible, while their origin servers were operating normally [2].

ISP and Network Issues

Sometimes the problem is not a specific service but the network path between your users and your servers. ISP outages, submarine cable cuts, and BGP route leaks can all cause connectivity issues that look like application outages from the user's perspective.

These are particularly difficult to diagnose because they are often regional and intermittent. Users in one city might have no problems while users in another cannot reach your site at all.

Cloud Provider Outages

When a major cloud provider has a significant outage, the blast radius can be enormous. The December 2021 AWS us-east-1 outage affected a staggering number of services and businesses because so much of the internet's infrastructure concentrates in that region [3].

Cloud provider outages are especially challenging because they often affect multiple layers of your stack simultaneously. Your application servers, your database, your cache, your queue, and your monitoring might all be down at the same time if they are in the same region.

Do not host your monitoring in the same region or on the same provider as your production infrastructure. If your monitoring goes down with your application, you will not know about the outage until someone calls you. Use an external monitoring service or run your monitoring infrastructure in a separate cloud region.

SSL and Certificate Issues

Expired or misconfigured SSL certificates can make your site appear down to users even when your servers are running perfectly. Browsers will show security warnings or refuse to connect entirely. Certificate transparency monitoring and automated renewal help prevent these issues, but they still happen. For more on keeping SSL health in check, see SSL monitoring explained.

Putting It All Together

Vendor monitoring is not a single tool or practice. It is a combination of visibility, preparation, and response capability.

Start with your dependency map. Know what you rely on and how critical each dependency is. Set up automated monitoring so you learn about outages quickly. Build response plans so your team knows what to do when alerts fire. Communicate proactively with your customers so outages become minor inconveniences rather than trust-breaking events. Track reliability data so you can hold vendors accountable and make informed decisions about your architecture.

The goal is not to prevent vendor outages. You cannot. The goal is to minimize their impact on your business and your customers. The organizations that do this well are the ones that invested in monitoring, built resilience into their systems, and practiced their response before the outage happened.

You do not need to do everything at once. Start by monitoring your most critical vendors. Build from there. Every step you take reduces your exposure and gives your team more time to respond when something goes wrong.

Know when your vendors go down

Monitor the services you depend on. Get alerts before your customers notice. Track Slack, Stripe, Shopify, AWS, and 30+ services.

Try Is That Down

References

[1] Productiv, "2023 State of SaaS," productiv.com/blog/state-of-saas-2023. Reports the average enterprise uses 371 SaaS applications.

[2] Fastly, "Summary of June 8 outage," fastly.com/blog/summary-of-june-8-outage, June 2021.

[3] AWS, "Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region," aws.amazon.com/message/12721/, December 2021.

[4] Gartner, "The Cost of Downtime," referenced in multiple Gartner research publications and widely cited across the industry as $5,600 per minute average.

[5] Facebook Engineering, "More details about the October 4 outage," engineering.fb.com/2021/10/05/networking-traffic/outage-details/, October 2021.