The Slack message came in at 11:47 PM: "We're down. Everything's down. The demo is in 9 hours."
I was consulting with a rapidly growing HR tech company preparing for their SOC 2 audit. They'd been laser-focused on security controls—firewalls, encryption, access management. But they'd completely overlooked availability. Now, the night before their biggest demo of the year—a Fortune 100 prospect—their entire platform was inaccessible.
The CEO asked me a question I've heard dozens of times: "Isn't SOC 2 just about security? Why does uptime matter?"
That's when I had to deliver some hard truth: In SOC 2, availability isn't just a nice-to-have. For service organizations, it's often the criteria that determines whether you keep or lose customers.
What SOC 2 Availability Really Means (And Why Most People Get It Wrong)
After spending fifteen years helping organizations achieve SOC 2 compliance, I've learned that most people fundamentally misunderstand the Availability criteria. They think it's about having good uptime numbers. That's like saying cooking is about having ingredients.
Availability in SOC 2 is about demonstrating that you have systematic processes to ensure your systems are available for operation and use as committed or agreed upon.
Let me break down what that actually means:
"SOC 2 Availability isn't asking 'Are you up?' It's asking 'Can you prove you've designed your systems to stay up, can detect when they go down, and can recover when they fail?'"
The Three Pillars of SOC 2 Availability
Through countless audits, I've learned that SOC 2 availability rests on three fundamental pillars:
Pillar | What It Means | What Auditors Look For |
|---|---|---|
Design | Your systems are architected for availability | Redundancy, load balancing, failover mechanisms, capacity planning |
Monitoring | You can detect availability issues immediately | Real-time monitoring, alerting systems, performance metrics, incident tracking |
Response | You can restore service when issues occur | Incident response procedures, backup systems, recovery protocols, communication plans |
I learned this framework the hard way. In 2019, I worked with a SaaS company that had 99.97% uptime—phenomenal numbers. But they failed their SOC 2 audit on availability.
Why? They couldn't demonstrate how they maintained that uptime. No documented procedures. No capacity planning. No formal incident response. They were just lucky. And in SOC 2, luck doesn't count.
The Real Cost of Ignoring Availability
Let me share a story that still makes me wince.
A fintech startup I advised in 2021 was crushing it. They had brilliant technology, happy customers, and were closing deals left and right. They achieved SOC 2 Type I certification focused solely on Security criteria because their customers didn't explicitly ask for Availability.
Then they landed a major bank as a customer—$3.2 million annual contract. The bank required SOC 2 Type II with Availability criteria. No problem, they thought. We're always up.
Six months into the audit period, disaster struck. A routine database upgrade went wrong. Their system was down for 14 hours. When I reviewed their response, I found:
No documented rollback procedures
No communication plan (customers learned about the outage from Twitter)
No redundancy (single point of failure in their database)
No tested backups (they had backups, but had never verified they worked)
The bank didn't renew. But worse, word spread. Three other enterprise prospects in their pipeline demanded to see their SOC 2 report including Availability criteria. They couldn't provide it.
Lost revenue: over $8 million. All because they treated availability as an afterthought.
"Your security can be perfect, but if customers can't access your service, they'll find a competitor who's just 'good enough' and actually available."
Understanding the SOC 2 Availability Criteria: A Deep Dive
Let me walk you through what SOC 2 actually requires for Availability. This is based on the AICPA Trust Services Criteria, which I've probably read about 200 times at this point.
Common Criteria (CC) That Impact Availability
First, understand that availability builds on the Common Criteria that apply to all SOC 2 audits:
Common Criteria | Availability Application | Real-World Example |
|---|---|---|
CC1: Control Environment | Management commitment to availability | SLA commitments in contracts, uptime as a company KPI |
CC2: Communication | How availability targets are communicated | Status pages, customer notifications, internal escalation procedures |
CC3: Risk Assessment | Identifying availability threats | Capacity planning, load testing, failure mode analysis |
CC4: Monitoring | Tracking system performance | APM tools, uptime monitoring, performance dashboards |
CC5: Control Activities | Processes ensuring availability | Change management, incident response, backup procedures |
Availability-Specific Criteria (A1)
The availability criteria specifically focus on:
A1.1: The entity maintains, monitors, and evaluates current processing capacity and use of system components to manage capacity demand and to enable the implementation of additional capacity to help meet its objectives.
In plain English: You need to prove you're not flying blind. You know your system's limits, monitor how close you're getting to them, and have a plan to scale before you hit the wall.
I worked with an e-commerce platform that learned this lesson during Black Friday 2020. They had no capacity monitoring. Their infrastructure scaled automatically, but they hit their cloud provider's account limits at 2 PM on the busiest shopping day of the year. Site down for 4 hours. Revenue lost: $1.2 million.
A1.2: The entity authorizes, designs, develops or acquires, implements, operates, approves, maintains, and monitors environmental protections, software, data backup processes, and recovery infrastructure to meet its objectives.
Translation: You've thought about what could go wrong (power failures, hardware failures, disasters) and have actual plans—not just hopes—to deal with them.
A1.3: The entity tests recovery plan procedures supporting system recovery to meet its objectives.
This is the one that kills most organizations: You don't just have backup and recovery plans. You test them. Regularly.
A healthcare SaaS company I consulted with had beautiful disaster recovery documentation. Hundreds of pages. Never tested. When they finally ran a DR test for their SOC 2 audit, they discovered their backups were corrupted. They'd been backing up corrupted data for seven months.
Building an Availability Program That Actually Works
Let me share the framework I've used with over 30 companies to build SOC 2-compliant availability programs. This isn't theory—this is what actually gets you through audits and keeps your systems running.
Step 1: Define Your Availability Commitments
You can't manage what you don't measure, and you can't measure what you don't define.
Here's a table of common availability commitments and what they actually mean:
Uptime Commitment | Allowable Downtime | Typical Use Case | What It Takes |
|---|---|---|---|
99.9% ("Three Nines") | 8.76 hours/year<br>43.8 minutes/month | Small business services, internal tools | Basic redundancy, manual failover, business hours support |
99.95% | 4.38 hours/year<br>21.9 minutes/month | Professional SaaS, B2B platforms | Active-passive redundancy, automated monitoring, 24/7 on-call |
99.99% ("Four Nines") | 52.56 minutes/year<br>4.38 minutes/month | Enterprise SaaS, financial services | Active-active redundancy, automated failover, 24/7 NOC |
99.999% ("Five Nines") | 5.26 minutes/year<br>26 seconds/month | Critical infrastructure, payment processing | Multi-region active-active, chaos engineering, massive investment |
I worked with a startup in 2022 that promised 99.99% uptime to all customers. Sounds great, right? Problem: they had no idea what that actually meant operationally.
When we calculated their architecture requirements, they were shocked:
They needed multi-region deployment ($8,000/month additional infrastructure costs)
24/7 on-call rotation (required hiring 2 additional engineers)
Automated failover systems (3 months of development time)
Extensive monitoring and alerting infrastructure
Their actual architecture could realistically deliver 99.9% uptime. They had to either upgrade their infrastructure or revise their commitments. They chose to be honest: 99.9% for standard tier, 99.95% for premium customers paying 40% more.
Lesson learned: Your commitments need to match your capabilities. SOC 2 auditors will verify both.
Step 2: Architect for Availability
I've seen brilliant architectures and I've seen disasters. Here's what actually matters:
Eliminate Single Points of Failure
Every system has a weakest link. Your job is to find and eliminate them.
A common architecture I see that fails availability requirements:
[Load Balancer] → [Single Database] → [Application Servers (multiple)]
↓
[Single Backup System]
The application servers are redundant, but the database is a single point of failure. When it goes down (not if, when), everything stops.
An architecture that passes availability requirements:
[Load Balancer - Primary] ←→ [Load Balancer - Standby]
↓
[Application Server Cluster] (3+ nodes across availability zones)
↓
[Database Primary] ←→ [Database Replica] (automatic failover)
↓
[Backup System - Primary] → [Backup System - Secondary] (different region)
Real example: I helped a marketing automation company redesign their architecture. Their original setup had 12 single points of failure. After our work, they had zero. Their infrastructure costs increased by 34%, but their unplanned downtime decreased by 91%.
Implement Proper Monitoring
Here's what SOC 2 auditors want to see in your monitoring:
Monitoring Category | What to Monitor | Alert Thresholds | Example Tools |
|---|---|---|---|
Infrastructure | Server CPU, memory, disk, network | >80% sustained for 5 minutes | Datadog, New Relic, CloudWatch |
Application | Response time, error rates, transaction volume | >500ms p95 latency, >1% error rate | APM tools, custom dashboards |
Database | Query performance, connection pool, replication lag | >100ms query time, >30s replication lag | Database-specific monitoring |
External Dependencies | API availability, response times | >5s timeout, >2% failure rate | Pingdom, UptimeRobot, synthetic monitoring |
User Experience | Real user monitoring, transaction success | Baseline deviation >20% | RUM tools, session replay |
"If you're not monitoring it, you don't know it's broken. If you don't know it's broken, you can't fix it. If you can't fix it, you're not meeting availability requirements."
Step 3: Document Everything (Yes, Everything)
I know, I know. Engineers hate documentation. I get it. But here's the truth: In SOC 2 audits, if it's not documented, it doesn't exist.
I've seen companies with sophisticated availability practices fail audits because they couldn't produce documentation. Here's what you actually need:
Essential Availability Documentation
Document Type | What It Contains | Update Frequency | Audit Importance |
|---|---|---|---|
System Architecture Diagrams | Infrastructure components, data flows, redundancy | Quarterly or with major changes | Critical |
Capacity Management Plan | Current capacity, growth projections, scaling triggers | Monthly | High |
Incident Response Procedures | Detection, escalation, response, recovery steps | Semi-annually | Critical |
Disaster Recovery Plan | Recovery procedures, RTO/RPO targets, contact lists | Quarterly | Critical |
Change Management Procedures | How changes are tested, approved, deployed, rolled back | Annually | High |
Monitoring and Alerting Configuration | What's monitored, alert thresholds, escalation paths | Monthly | High |
Service Level Agreements | Uptime commitments, response times, remediation | Per contract | Critical |
Availability Metrics Reports | Actual uptime, incidents, MTTR, trends | Monthly | Critical |
Step 4: Test Your Recovery Procedures
This is where companies fail most often. They have beautiful disaster recovery plans that have never been tested.
I worked with a financial services company in 2023 that was convinced they had robust DR procedures. We scheduled a DR test. Here's what we discovered:
Their backup restoration process documentation was 18 months old and no longer matched their current infrastructure
Three critical systems weren't included in the backup schedule
The backup restoration took 14 hours (their RTO was 4 hours)
Five key personnel who were supposed to execute the DR plan had left the company
Their failover database wasn't receiving updates (6 weeks of data would have been lost)
We found all this during a test. Imagine if it had been a real disaster.
Here's my testing framework that actually works:
Test Type | Frequency | What You're Testing | Duration | Disruption Level |
|---|---|---|---|---|
Backup Restoration | Monthly | Can you restore from backups? | 2-4 hours | None (test environment) |
Failover Testing | Quarterly | Do your redundant systems actually work? | 4-8 hours | Minimal (planned maintenance window) |
Disaster Recovery Drill | Semi-annually | Can you recover from a complete failure? | 1-2 days | Significant (but scheduled) |
Tabletop Exercise | Quarterly | Does your team know what to do? | 2-3 hours | None |
Chaos Engineering | Monthly (for mature orgs) | How does your system handle unexpected failures? | Ongoing | Controlled |
The Metrics That Actually Matter
SOC 2 auditors don't just want to see that you're available. They want evidence that you're managing availability. That means metrics.
Here are the metrics I track for every client:
Uptime and Availability Metrics
Metric | How to Calculate | Target (varies by commitment) | Why It Matters |
|---|---|---|---|
Uptime Percentage | (Total Time - Downtime) / Total Time × 100 | 99.9% - 99.99% | Primary availability indicator |
Mean Time Between Failures (MTBF) | Total Uptime / Number of Failures | >720 hours (30 days) | Indicates system reliability |
Mean Time To Detect (MTTD) | Time from failure to detection | <5 minutes | Shows monitoring effectiveness |
Mean Time To Respond (MTTR) | Time from detection to response initiated | <15 minutes | Shows team readiness |
Mean Time To Recover (MTTR) | Time from failure to full recovery | <1 hour for critical systems | Overall recovery capability |
Planned vs Unplanned Downtime | Ratio of scheduled maintenance to incidents | >80% planned | Indicates operational maturity |
Performance Metrics
Metric | Measurement | Target | Impact on Availability |
|---|---|---|---|
Response Time (p50) | Median API/page response time | <200ms | User experience baseline |
Response Time (p95) | 95th percentile response time | <500ms | Indicates system stress |
Response Time (p99) | 99th percentile response time | <1000ms | Catches edge cases |
Error Rate | Failed requests / Total requests | <0.1% | Direct availability indicator |
Throughput | Requests per second | Varies | Capacity planning |
Real story: A client was proudly reporting 99.97% uptime. But their p95 response time was 8 seconds. Technically available? Yes. Actually usable? Barely. During their SOC 2 audit, the auditor questioned whether slow-but-technically-running systems met the spirit of availability commitments. We had to implement performance thresholds into our availability calculations.
Building an Incident Response Program That Auditors Love
After reviewing hundreds of incident response procedures, I've learned what works and what doesn't.
The Incident Severity Matrix
Every organization needs a clear way to classify incidents. Here's the framework I use:
Severity | Definition | Response Time | Escalation | Example |
|---|---|---|---|---|
SEV-1 (Critical) | Complete service outage or data breach | <5 minutes | Immediate - all hands | Full platform down, payment processing stopped |
SEV-2 (High) | Major functionality impaired, significant user impact | <30 minutes | On-call team + management | Single major feature down, 25%+ users affected |
SEV-3 (Medium) | Partial functionality impaired, moderate user impact | <2 hours | On-call team | Performance degradation, minor feature issue |
SEV-4 (Low) | Minor issue, minimal user impact | <8 hours (business hours) | Assigned engineer | UI glitch, single user issue |
The Incident Response Process That Actually Works
I've seen overcomplicated incident response procedures that nobody follows under pressure. Here's what actually works:
The 5-Step Framework:
Detect (Automated monitoring triggers alert)
Assess (On-call engineer determines severity)
Respond (Execute response procedures for that severity)
Recover (Restore normal operations)
Review (Post-incident analysis and improvement)
Real example: A SaaS company I worked with had a 47-page incident response manual. Nobody read it. When incidents happened, people panicked and improvised.
We simplified it to a single-page quick reference guide with clear decision trees and contact information. We also created severity-specific playbooks. MTTR dropped from 3.2 hours to 42 minutes within two months.
The Availability Tools Stack
Let me share the tools that actually help you meet SOC 2 availability requirements:
Essential Tools by Category
Category | Purpose | Example Tools | Approximate Cost |
|---|---|---|---|
Infrastructure Monitoring | Server, network, cloud resource monitoring | Datadog, New Relic, CloudWatch | $200-$2,000/month |
Application Performance | Code-level performance tracking | New Relic APM, AppDynamics, Dynatrace | $300-$3,000/month |
Uptime Monitoring | External availability checks | Pingdom, UptimeRobot, StatusCake | $15-$200/month |
Log Management | Centralized logging and analysis | Splunk, ELK Stack, Datadog Logs | $500-$5,000/month |
Incident Management | Alert routing, on-call scheduling, coordination | PagerDuty, Opsgenie, VictorOps | $30-$100/user/month |
Status Page | Customer communication during incidents | Statuspage.io, Atlassian Statuspage | $30-$300/month |
Load Testing | Capacity validation | LoadRunner, JMeter, k6, BlazeMeter | $0-$500/month |
Backup and DR | Data protection and recovery | Veeam, AWS Backup, Backblaze | Varies widely |
"Don't choose tools because they're popular. Choose them because they'll actually help you detect problems faster and recover quicker. Your auditor doesn't care about your tech stack—they care about whether it works."
Common Availability Pitfalls (And How to Avoid Them)
After fifteen years, I've seen the same mistakes over and over. Here are the big ones:
Pitfall #1: Confusing Availability with Uptime
The mistake: "We're up 99.9% of the time, so we meet availability requirements."
The reality: SOC 2 cares about your process for maintaining availability, not just the outcome.
I audited a company with 99.95% uptime that failed availability criteria because they couldn't explain how they achieved it. They had no documented procedures, no capacity planning, no testing. They were just lucky.
Pitfall #2: No Capacity Planning
The mistake: "We'll scale when we need to."
The reality: By the time you realize you need to scale, it's too late.
A real client story: B2B SaaS company, steady growth, no capacity monitoring. They landed a huge enterprise client that tripled their user base overnight. Their database couldn't handle the load. System became unusable for 16 hours while they frantically upgraded infrastructure.
Lost revenue: $420,000 (existing customers churned due to performance issues). The enterprise client walked away.
Pitfall #3: Untested Backup and Recovery
The mistake: "We have backups, so we're fine."
The reality: Untested backups are just expensive storage costs.
A healthcare company discovered during their SOC 2 audit that while they had been backing up data for 2 years, they'd never tested restoration. When we tested, we found:
40% of backups were corrupted
Restoration process took 3x longer than documented
Some critical systems weren't backed up at all
Pitfall #4: No Communication Plan
The mistake: "We'll figure out communication when something goes wrong."
The reality: During an outage, you don't have time to figure out who to notify and how.
I watched a company lose three major customers during a 4-hour outage—not because of the outage itself, but because they had no communication plan. Customers found out about the outage from Twitter before receiving any official communication. Trust destroyed.
Pitfall #5: Treating DR as a Project, Not a Program
The mistake: "We built a disaster recovery plan, we're done."
The reality: DR plans expire faster than milk if not maintained.
A financial services company had a beautiful DR plan from 2019. In 2023, when we tested it:
60% of IP addresses had changed
Half the documented personnel had left the company
Several applications in the DR plan had been retired
New critical systems weren't included
Building a Culture of Availability
Here's something I've learned: Technology is only 30% of the availability equation. Process is 30%. Culture is 40%.
The most available systems I've seen aren't at companies with the best technology. They're at companies where everyone cares about availability.
How to Build That Culture
1. Make Availability Visible
Install TVs in your office showing real-time availability dashboards. Include uptime metrics in company all-hands. Celebrate availability milestones.
A client implemented a "days without customer-impacting incident" counter in their office. It created healthy pressure to maintain availability and celebrate when they hit milestones.
2. Blameless Post-Mortems
When incidents happen (and they will), focus on learning, not blaming.
I worked with a company that punished engineers for causing outages. Result: engineers started hiding problems, leading to bigger issues. When they switched to blameless post-mortems focused on system improvement, incident frequency dropped 60% because people felt safe raising concerns.
3. Include Availability in Performance Reviews
What gets measured gets managed. If availability isn't part of how you evaluate team performance, it won't be a priority.
4. Invest in Training
Every engineer should understand your availability architecture, monitoring systems, and incident response procedures.
A SaaS company I advised runs quarterly "game days" where they simulate outages and have different team members practice incident response. Result: their MTTR dropped from 2.1 hours to 34 minutes.
The ROI of Availability
Let me talk about money, because that's what gets CFO attention.
A client calculated their true cost of downtime:
Cost Category | Calculation | Annual Impact |
|---|---|---|
Direct Revenue Loss | $50,000/hour average transaction volume | $438,000 (8.76 hours @ 99.9%) |
SLA Credits | 10% monthly fee for missing 99.9% | $120,000 |
Customer Churn | 15% churn rate after major outages | $890,000 |
Support Costs | Additional support during/after outages | $67,000 |
Brand Damage | Estimated impact on new sales | $340,000 |
Engineering Time | Incident response and recovery | $180,000 |
Total Annual Cost | $2,035,000 |
They invested $420,000 to improve availability from 99.9% to 99.95% (reducing annual downtime from 8.76 hours to 4.38 hours).
ROI: 385% in the first year.
"Availability isn't a cost center. It's a profit protector. Every hour of downtime prevented is money in the bank and trust in the brand."
Your Availability Roadmap
If you're starting from scratch, here's your 6-month plan:
Month 1: Assessment and Planning
Document current architecture
Identify single points of failure
Define availability commitments
Assess current monitoring capabilities
Calculate current actual uptime
Month 2: Quick Wins
Implement basic uptime monitoring
Set up incident alert routing
Create incident severity definitions
Document current recovery procedures
Establish change management process
Month 3: Architecture Improvements
Address critical single points of failure
Implement redundancy for critical components
Set up automated backups
Configure automated failover (where possible)
Implement load balancing
Month 4: Monitoring and Alerting
Deploy comprehensive monitoring
Configure meaningful alerts
Set up centralized logging
Implement performance tracking
Create availability dashboards
Month 5: Testing and Documentation
Conduct first backup restoration test
Run failover testing
Document all procedures
Create incident response playbooks
Update all architecture diagrams
Month 6: Training and Refinement
Train team on incident response
Conduct tabletop exercises
Review and refine procedures
Establish ongoing testing schedule
Prepare for SOC 2 audit
Final Thoughts: Availability as Competitive Advantage
I started this article with a midnight crisis—a company going down before their biggest demo. Let me tell you how it ended.
We got them back online in 3.5 hours. They made the demo (barely). But they lost the deal. The prospect had googled them during the outage and found a "We're down again" tweet from a frustrated customer.
That company learned an expensive lesson: in today's always-on world, availability isn't just a technical requirement—it's a business imperative.
But here's the good news: companies that treat availability as a core competency don't just pass SOC 2 audits. They win customers, retain them longer, and command premium pricing.
One of my clients includes their real-time uptime stats in their sales presentations. They show prospects their 99.97% uptime over the past 12 months, backed by their SOC 2 report. Their close rate for enterprise deals is 43% higher than industry average.
Availability isn't about perfection. It's about preparation, process, and continuous improvement.
Your systems will fail. That's not a question of if, but when. The question is: when they do fail, will you have the systems, processes, and culture in place to detect quickly, respond effectively, and recover gracefully?
That's what SOC 2 Availability criteria really measures. And that's what separates thriving companies from cautionary tales.