I remember the exact moment I learned the hard way about the importance of detection capabilities. It was 2017, and I was three months into a consulting engagement with a pharmaceutical company. During a routine review, we discovered evidence of unauthorized access that had been happening for eleven months. Eleven months! The attackers had been exfiltrating research data, and nobody knew because, quite simply, nobody was looking.
The CISO went pale. "But we have a firewall," he said. "And antivirus. How did this happen?"
"You had walls," I told him, "but no security guards watching them."
That conversation changed how I approach security architecture forever. After fifteen years in this field, I've learned that prevention without detection is just wishful thinking. The NIST Cybersecurity Framework's Detect function isn't just one of five core functions—it's often the difference between a contained incident and a catastrophic breach.
Understanding the NIST CSF Detect Function: More Than Just Monitoring
Let me be blunt: most organizations are terrible at detection. They spend 80% of their security budget on prevention and maybe 10% on detection. Then they wonder why breaches go undetected for an average of 207 days (according to the 2024 IBM Cost of a Data Breach Report).
The NIST Cybersecurity Framework Detect function addresses this critical gap. It's built on a simple premise: you can't stop every attack, but you can detect and respond to them before they cause catastrophic damage.
"Prevention is ideal, but detection is essential. You can survive a detected breach. You might not survive an undetected one."
The Three Detect Categories That Matter
The NIST CSF breaks the Detect function into three main categories. I've implemented each of these dozens of times, and here's what I've learned:
NIST Category | What It Means | Why It Matters | Real Impact |
|---|---|---|---|
Anomalies and Events (DE.AE) | Detecting unusual activity and potential security incidents | Finds threats that bypass preventive controls | Average detection time: 24 hours vs 207 days |
Security Continuous Monitoring (DE.CM) | Ongoing observation of networks, systems, and data | Provides real-time visibility into security posture | 73% faster incident response |
Detection Processes (DE.DP) | Procedures and roles for detection activities | Ensures detection happens consistently | 89% reduction in false positives |
Anomalies and Events (DE.AE): Teaching Systems to Notice What's Wrong
In 2019, I worked with a financial services company that was convinced they had solid detection capabilities. They had a SIEM (Security Information and Event Management system) that collected logs from everything. Millions of events per day.
The problem? Nobody was actually analyzing them. The SIEM had become a very expensive log storage system.
During my assessment, I asked their security analyst to show me alerts from the past week. He pulled up a dashboard showing 14,872 alerts. I asked him how many he'd investigated.
"Honestly?" he said. "Maybe twenty. The rest are probably false positives."
Probably.
This is the challenge with anomaly detection: it's not about collecting data—it's about understanding what matters.
The Five Sub-Categories of Anomaly Detection That Actually Work
Here's how I implement DE.AE across organizations, based on what actually produces results:
Sub-Category | Focus Area | Implementation Priority | Common Pitfall |
|---|---|---|---|
DE.AE-1 | Establish baseline of network operations | HIGH - Foundation for everything else | Baselines go stale; update quarterly |
DE.AE-2 | Detect potentially malicious events | HIGH - Core detection capability | Too many false positives overwhelm teams |
DE.AE-3 | Collect and correlate event data | CRITICAL - Can't detect without data | Collect everything, analyze nothing |
DE.AE-4 | Determine impact of detected events | MEDIUM - Risk-based prioritization | Treat all alerts equally (wrong!) |
DE.AE-5 | Define alert thresholds | CRITICAL - Signal vs noise | Set once, never adjust (disaster) |
DE.AE-1: Establishing Behavioral Baselines (Or: Learning What Normal Looks Like)
Here's something nobody tells you: you can't detect anomalies until you know what normal looks like.
I worked with a healthcare provider in 2021 that kept getting alerts about "unusual database access." Every. Single. Day. Hundreds of alerts. The security team had become numb to them.
When we dug in, we discovered that their baseline was established during a holiday weekend when almost nobody was working. So "normal" meant 5% of actual normal activity. Everything else looked anomalous.
We spent two weeks establishing proper baselines:
Network traffic patterns during business hours vs off-hours
Typical data access patterns for different user roles
Standard authentication patterns (failed attempts, location, timing)
Normal system behavior (CPU, memory, disk usage)
Typical user behavior (applications accessed, data volumes, work patterns)
The impact was immediate. Alert volume dropped 87%. But here's the kicker: we actually detected MORE real threats because analysts could finally focus on genuine anomalies.
"A baseline built on a quiet weekend is like taking someone's temperature while they're sleeping and declaring them hypothermic when they wake up and start moving around."
Practical Baseline Implementation: What I Do Every Time
Here's my standard approach for establishing meaningful baselines:
Week 1-2: Data Collection
Collect data covering complete business cycles
Include both busy and slow periods
Capture seasonal variations if possible
Document any known anomalous events during collection
Week 3-4: Analysis and Refinement
Identify patterns and outliers
Segment by time of day, day of week, business unit
Account for legitimate variations
Remove actual incidents from baseline data
Week 5-6: Validation and Tuning
Test baselines against known good and bad activity
Adjust thresholds to minimize false positives
Document exceptions and edge cases
Train team on what baselines mean
Ongoing: Maintenance
Review baselines quarterly (minimum)
Update after major business changes
Track baseline drift over time
Document all baseline adjustments
DE.AE-2: Detecting Potentially Malicious Events (The Art of Seeing Threats)
Let me share a detection win that still makes me smile.
In 2020, I implemented detection controls for a software company. Three weeks after going live, our SIEM flagged something subtle: a service account authenticating from two different countries within 45 seconds. Individually, neither authentication was suspicious. Together? Impossible without credential theft.
We investigated immediately. Turned out a developer's laptop had been compromised, and an attacker had extracted service account credentials. The attacker was in Singapore; the legitimate automated process was in AWS us-east-1. The near-simultaneous logins from different geolocations triggered our correlation rules.
We contained the breach within 3 hours. The attacker had accessed exactly one internal system before we cut them off. Total damage: minimal. Total cost: about $15,000 in incident response.
Compare that to the $4.88 million average breach cost. That's an ROI of 325,500%.
Not bad for a "potentially malicious event" detection.
The Detection Use Cases That Actually Catch Threats
Based on my experience implementing detection programs, these are the use cases that consistently identify real threats:
Detection Category | What to Monitor | Why Attackers Can't Hide It | Example Alert Logic |
|---|---|---|---|
Impossible Travel | User authentication from different locations | Physical laws of geography | Login from NYC, then London 30 minutes later |
Privilege Escalation | Changes to user permissions | Need elevated access to accomplish goals | Standard user account granted admin rights |
After-Hours Access | Activity during unusual times | Off-hours = less detection risk (they think) | Database access at 3 AM by user who works 9-5 |
Data Exfiltration | Large outbound data transfers | Need to steal data to monetize attack | 50GB uploaded to unknown cloud storage |
Lateral Movement | System-to-system access patterns | Need to explore network to find valuable data | Web server initiating SMB connections to databases |
Failed Authentication Spikes | Multiple failed login attempts | Credential stuffing and brute force attacks | 500 failed logins in 10 minutes |
New Admin Accounts | Creation of privileged accounts | Persistence mechanism for long-term access | New domain admin created at 2 AM |
Process Anomalies | Unexpected process execution | Malware needs to run to be effective | PowerShell launched from Word document |
I learned something critical about detection logic early in my career: simple rules, consistently enforced, beat complex AI 90% of the time.
DE.AE-3: Event Data Collection and Correlation (Making the Pieces Connect)
Here's a hard truth from the trenches: most organizations collect way too much data and correlate far too little of it.
I once audited a company spending $180,000 annually on log storage. They had seven years of logs for compliance purposes. When I asked what they actually did with the logs, the answer was crickets.
"We search them when we need to," the IT manager said.
"How often do you need to?" I asked.
"Maybe three times last year."
They were spending $60,000 per search. That's not a detection program—that's expensive digital hoarding.
The Correlation Strategy That Actually Works
Here's how I build effective correlation programs:
1. Start With High-Value Correlations
Don't try to correlate everything. Start with the combinations that indicate actual compromise:
Example Correlation Rules That Catch Real Threats:2. Correlate Across Time Windows
One of my favorite detection wins involved a patient attacker. They were smart: they'd authenticate, wait 4 hours, then start their malicious activity. They knew most organizations only correlated events within 15-minute windows.
We caught them by extending our correlation window to 24 hours. The pattern became obvious: login, long pause, unusual activity. Every. Single. Time.
3. Context Is Everything
Raw correlation without context generates garbage alerts. Here's the data you need to make correlations meaningful:
Contextual Data | Why It Matters | Example Use |
|---|---|---|
User Role/Title | Different roles have different normal behaviors | CEO accessing HR system = normal; Intern accessing financial records = suspicious |
Asset Criticality | Not all systems are equal | Access to dev server vs production financial database |
Time of Day/Week | Temporal context changes risk | Weekend access by accounting staff vs weekday |
Geographic Location | Physical context matters | Office location vs foreign country |
Historical Behavior | Individual baseline | User who always works remotely vs new remote access |
Peer Behavior | Departmental context | What are similar users doing right now? |
DE.AE-4: Determining Impact (Why "Alert Fatigue" Kills Security Programs)
Let me tell you about the worst detection program I ever inherited.
In 2018, I started working with a company whose security team was drowning. They had implemented a new SIEM six months earlier and were getting 12,000 alerts per day. Per. Day.
The security analysts were broken. They'd come in every morning, see thousands of new alerts, and just start clicking "Resolved" without investigating. One analyst told me, "If I actually investigated every alert, I'd need 47 hours per day."
This is what happens when you don't properly determine impact.
The fix? We implemented a proper impact assessment framework:
Impact Level | Criteria | Response Time | Assignment | Example Scenarios |
|---|---|---|---|---|
CRITICAL | Production systems affected; Active data exfiltration; Ransomware detected | <15 minutes | Senior analyst + CISO notification | Database server sending 10GB to external IP |
HIGH | Privileged account compromise; Multiple systems affected; Confirmed malware | <1 hour | Senior analyst | Domain admin account authenticating from unusual location |
MEDIUM | Single system compromise; Suspicious but not confirmed; Policy violations | <4 hours | Standard analyst | Failed login attempts exceeding threshold |
LOW | Potential false positive; Informational; Minor policy deviation | <24 hours | Automated or junior analyst | Single failed authentication |
INFORMATIONAL | Baseline violations; Behavioral anomalies; Audit triggers | No SLA | Logged for analysis | User accessing system at unusual (but not impossible) time |
After implementing this framework, we went from 12,000 alerts per day to about 40 actionable alerts. The other 11,960 weren't deleted—they were properly categorized as informational and aggregated for trend analysis.
Three months later, we caught a major intrusion attempt. The alert was marked CRITICAL, the senior analyst responded in 8 minutes, and we contained the attack before any data left the network.
The analyst who'd been clicking "Resolved" on everything six months earlier? He personally thanked me. "I can actually do my job now," he said.
"An alert without impact context is just noise. Noise doesn't get investigated. And uninvestigated alerts are just permission slips for attackers."
DE.AE-5: Setting Alert Thresholds (The Goldilocks Problem)
Here's a question I get constantly: "How many failed login attempts before we alert?"
The answer? It depends.
Too low, and you'll drown in false positives. Too high, and you'll miss real attacks. This is the Goldilocks problem of detection: the threshold needs to be just right.
I learned this lesson painfully in my early career. I set failed authentication thresholds at 5 attempts because "that's the industry standard." Within a week, we were getting 800 alerts per day. Users with fat fingers, expired passwords, or caps lock mistakes were triggering alerts constantly.
We raised the threshold to 50 attempts. Two weeks later, a credential stuffing attack came through with 47 attempts per account. We missed it entirely.
My Framework for Setting Effective Thresholds
Here's what I do now:
Step 1: Understand Your Environment
Collect baseline data for 30 days minimum. Calculate:
Mean (average) value
Median (middle) value
Standard deviation (variation)
95th percentile (captures most normal activity)
99th percentile (captures nearly all normal activity)
Step 2: Set Initial Thresholds
Metric Type | Starting Threshold | Rationale |
|---|---|---|
Authentication Failures | 95th percentile + 2 standard deviations | Catches outliers while allowing normal variation |
Data Transfers | 99th percentile + 50% | Large transfers are less frequent; higher threshold needed |
Access Attempts | 95th percentile + 3 standard deviations | Balance between detection and false positives |
Failed Privileged Actions | Any occurrence | Privilege failures are always suspicious |
After-Hours Activity | 75th percentile (lower threshold) | Less activity = easier to spot anomalies |
Step 3: Tune Aggressively
For the first 30 days, review every alert and track:
True positives (real threats)
False positives (benign activity)
False negatives (threats you missed)
Adjust thresholds weekly based on this data.
Step 4: Implement Dynamic Thresholds
Static thresholds fail. I learned this when a client's business volume increased 300% over six months. All our carefully tuned thresholds became useless.
Now I implement dynamic thresholds that adjust based on:
Time of day (business hours vs after hours)
Day of week (weekday vs weekend)
Season (retail during holidays, universities during enrollment)
Known events (system maintenance, business travel, conferences)
Real-World Threshold Example
Let me show you a threshold tuning case study from 2022:
Initial Situation:
Failed authentication threshold: 10 attempts
Alerts per day: 340
True positives: 2-3 per month
False positive rate: 99.7%
After Analysis:
Normal user failed attempts: 0-3 per day (98% of users)
Users with persistent issues: 4-8 per day (1.8% of users)
Actual attacks: 15+ attempts within 5 minutes
New Threshold:
15 failed attempts within a 5-minute window
OR 25 failed attempts in 24 hours
AND not from known problematic accounts
Results:
Alerts per day: 12
True positives: 8-10 per month
False positive rate: 3%
We went from investigating 340 mostly useless alerts per day to 12 highly accurate ones. The security team could actually investigate every alert thoroughly.
Security Continuous Monitoring (DE.CM): The Always-On Security Guard
Most organizations think of monitoring as "collect logs and search them when something goes wrong." That's not monitoring—that's forensics with extra steps.
Real continuous monitoring is active, real-time observation with immediate alerting.
The DE.CM Categories That Provide Real Visibility
Sub-Category | Monitoring Focus | Why It Matters | Key Technologies |
|---|---|---|---|
DE.CM-1 | Network monitoring | First line of defense; catches lateral movement | Network TAPs, NetFlow, SPAN ports |
DE.CM-2 | Physical environment monitoring | Physical access often precedes logical breach | Cameras, badge readers, environmental sensors |
DE.CM-3 | Personnel activity monitoring | Insider threats and compromised accounts | User activity monitoring, DLP, CASB |
DE.CM-4 | Malicious code detection | Known threats identification | Antivirus, EDR, sandbox analysis |
DE.CM-5 | Unauthorized devices/software | Shadow IT and supply chain attacks | Network access control, asset inventory |
DE.CM-6 | External service provider monitoring | Third-party compromise detection | Vendor security assessments, monitoring |
DE.CM-7 | Unauthorized personnel, connections, devices | Perimeter breach detection | Network admission control, IDS/IPS |
DE.CM-8 | Vulnerability scans | Proactive weakness identification | Vulnerability scanners, patch management |
DE.CM-1: Network Monitoring That Actually Works
In 2021, I implemented network monitoring for a manufacturing company. They had some basic firewalls and called it good.
During the first week of proper network monitoring, we discovered:
A cryptocurrency miner running on 40% of their factory floor computers
An engineering workstation sending data to an IP address in Belarus
An unauthorized VPN server on their network
Three unpatched Windows 2003 servers still running (in 2021!)
None of these showed up in their previous "monitoring" because they weren't actually watching network traffic—they were just logging firewall permits and denies.
Effective Network Monitoring Strategy
Here's what actually works:
Layer 1: NetFlow Analysis
Monitor traffic patterns, not packet contents
Identify communication anomalies
Detect data exfiltration by volume
Low overhead, high visibility
Layer 2: Full Packet Capture (Strategic)
Critical network segments only (database DMZ, executive network)
Deep inspection for threats
Forensic evidence collection
High storage requirements
Layer 3: IDS/IPS
Signature-based threat detection
Known attack pattern identification
Automatic blocking (IPS) of confirmed threats
Regular signature updates critical
Example Network Monitoring Detection:
ALERT: Unusual DNS Query Pattern
- Workstation: EXEC-LAPTOP-042
- Queries: 847 unique DNS requests in 10 minutes
- Pattern: Random subdomain queries to same domain
- Assessment: DNS tunneling for command and control
- Action: Immediate network isolationDE.CM-3: Personnel Activity Monitoring (The Insider Threat Detector)
Here's something that keeps CISOs up at night: 62% of data breaches involve insider threats or stolen credentials (Verizon DBIR 2024).
I witnessed this firsthand in 2020. A healthcare organization noticed unusual activity from a nurse's account—accessing patient records she had no clinical reason to view. We investigated.
Turned out she was selling celebrity patient information to tabloids. She'd been doing it for 14 months before monitoring caught her. The HIPAA fines alone exceeded $1.2 million.
The sad part? Simple monitoring would have caught her in week one. She was accessing 50-60 patient records per shift with no corresponding care activities.
User Activity Monitoring That Respects Privacy AND Catches Threats
This is delicate territory. Monitor too much, and you create a dystopian workplace. Monitor too little, and you miss insider threats.
Here's my balanced approach:
Monitor This | Don't Monitor This | Why the Distinction Matters |
|---|---|---|
✅ Access to sensitive data | ❌ Personal email content | Privacy vs security balance |
✅ Administrative actions | ❌ Websites visited (unless malicious) | Job function vs personal activity |
✅ After-hours activity | ❌ Keystroke logging | Red flags vs invasive surveillance |
✅ Large data transfers | ❌ Personal file contents | Risk-based vs intrusive |
✅ Privilege escalation | ❌ Personal conversations | Security events vs privacy violation |
✅ Policy violations | ❌ Break time activities | Relevant vs irrelevant |
Focus on WHAT users access, not WHY they're accessing it (until an alert triggers investigation).
DE.CM-4: Malicious Code Detection (Beyond Basic Antivirus)
"We have antivirus" is the security equivalent of "we have Band-Aids" in medicine. Great! But what about surgery?
Traditional antivirus catches maybe 40-50% of modern malware. I've seen ransomware waltz right past fully updated antivirus solutions because it was too new, too customized, or too clever.
Modern malicious code detection requires multiple layers:
Detection Method | What It Catches | What It Misses | Best Use Case |
|---|---|---|---|
Signature-Based AV | Known malware variants | Zero-day threats, polymorphic malware | Commodity malware, known threats |
Behavioral Analysis | Unknown malware acting suspiciously | Sophisticated attacks mimicking normal behavior | Ransomware, new attack techniques |
Sandboxing | Malware that needs to execute to reveal itself | Time-delayed malware, environment-aware attacks | Email attachments, downloaded files |
Machine Learning | Patterns indicating malicious intent | Completely novel attack methods | Large-scale threat hunting |
Memory Analysis | Fileless malware, in-memory exploits | Persistent threats in files | Advanced persistent threats |
Real Detection Example: Layered Defense in Action
Let me share a perfect example of why you need multiple detection layers.
In 2022, a financial services client got hit with a targeted spear phishing attack. The malware was custom-built for them. Here's how our layered detection responded:
Layer 1 - Email Gateway: ❌ MISSED
Malicious attachment had valid signature
Sender email looked legitimate
No known threat signatures
Layer 2 - Endpoint AV: ❌ MISSED
Zero-day malware, no signature
File appeared benign
Layer 3 - Sandbox Analysis: ⚠️ SUSPICIOUS
File exhibited some unusual behavior
Not definitive enough to block
Flagged for monitoring
Layer 4 - EDR (Endpoint Detection & Response): ✅ DETECTED
Process attempted to disable logging
Created persistence mechanism
Attempted network beacon to unknown domain
ALERT TRIGGERED
Response Time: 4 minutes from execution to containment
Single layer? Compromised. Multiple layers? Contained.
"Modern malware is like a burglar checking for different locks on your door. If you only have one lock, and they have that key, you're toast. Multiple detection layers mean multiple chances to catch them."
DE.CM-8: Vulnerability Scanning (Finding Problems Before Attackers Do)
Here's a harsh reality: the average organization has 57 critical vulnerabilities at any given time (Qualys Research 2024).
Want to know what's worse? Most organizations discover these vulnerabilities AFTER attackers exploit them.
I worked with a company in 2019 that learned this lesson expensively. They'd been breached through EternalBlue—the vulnerability behind WannaCry. In 2019. Two years after the patch was released.
"We didn't know we had vulnerable systems," the IT manager said.
"Did you scan for them?" I asked.
Silence.
They paid $890,000 in ransomware, response costs, and recovery. A vulnerability scanner costs about $10,000 annually.
That's an 8,900% markup for ignorance.
Vulnerability Scanning Strategy That Works
Here's my standard implementation:
Weekly: Authenticated Scans
Full network scan with credentials
Identifies missing patches
Discovers misconfigurations
Maps software inventory
Monthly: Unauthenticated Scans
External perspective (what attackers see)
Validates patch effectiveness
Identifies perimeter weaknesses
Tests external defenses
Quarterly: Comprehensive Assessments
Web application scanning
Database vulnerability assessment
IoT and operational technology scanning
Cloud infrastructure review
Continuous: Passive Monitoring
Network traffic analysis
Asset discovery
Change detection
Drift identification
Scan Type | Frequency | Focus | Typical Findings |
|---|---|---|---|
Internal Authenticated | Weekly | Missing patches, misconfigurations | 200-500 findings in typical network |
External Unauthenticated | Monthly | Internet-facing vulnerabilities | 20-50 critical findings |
Web Application | Monthly | OWASP Top 10, injection flaws | 30-100 findings per application |
Database | Quarterly | Default passwords, excessive permissions | 40-80 findings per database |
Cloud Configuration | Weekly | Misconfigured services, exposed data | 10-30 findings in typical cloud environment |
Detection Processes (DE.DP): Making Detection Sustainable
Having great detection technology is like owning a Ferrari—useless if nobody knows how to drive it.
I've seen organizations spend $500,000 on detection tools and $50,000 on the people and processes to use them. Six months later, the tools are shelfware and they're back to reactive firefighting.
The Five DE.DP Sub-Categories That Make or Break Programs
Sub-Category | Focus | Common Failure | Success Factor |
|---|---|---|---|
DE.DP-1 | Detection roles and responsibilities | Nobody owns detection | Clear ownership with authority |
DE.DP-2 | Detection activities comply with requirements | Checkbox compliance | Understanding WHY requirements exist |
DE.DP-3 | Detection process testing | Set it and forget it | Regular testing and adjustment |
DE.DP-4 | Event detection communication | Alerts die in queues | Clear escalation paths |
DE.DP-5 | Detection process improvement | Same mistakes repeated | Systematic learning from incidents |
DE.DP-1: Detection Roles (Who's Actually Watching?)
In 2020, I conducted a tabletop exercise for a retail company. I simulated a ransomware attack and asked: "Who's responsible for detecting this?"
Five different people thought they were. None of them actually were.
The IT manager thought the security team handled it. The security team thought the SOC handled it. The SOC thought the MSSP handled it. The MSSP thought they were only responsible for network monitoring. And the CISO thought everyone was handling their part.
This is shockingly common.
The Detection RACI Matrix That Actually Works
I implement a RACI model (Responsible, Accountable, Consulted, Informed) for every detection activity:
Example: Ransomware Detection
Activity | Responsible | Accountable | Consulted | Informed |
|---|---|---|---|---|
Monitor for indicators | SOC Analyst | SOC Manager | Threat Intel Team | CISO |
Investigate alerts | L2 Analyst | SOC Manager | IT Operations | Security Leadership |
Escalate incidents | SOC Manager | CISO | Legal, PR | Executive Team |
Coordinate response | Incident Manager | CISO | All stakeholders | Board |
Post-incident review | Security Team | CISO | All participants | Everyone |
Notice how EVERY activity has ONE accountable person. That's critical. Shared accountability is no accountability.
DE.DP-3: Testing Detection (Trust But Verify)
Here's an uncomfortable question I ask every client: "When's the last time you tested whether your detection actually works?"
The most common answer? "Uhh..."
In 2021, I worked with a healthcare organization that had invested heavily in EDR (Endpoint Detection and Response). They were confident in their detection capabilities. I asked if I could test them.
We simulated a ransomware attack in a controlled test environment. Their EDR missed it completely. The ransomware encrypted 2,000 test files before anyone noticed.
The CISO was devastated. "We spent $300,000 on this solution!"
The problem wasn't the technology—it was the configuration and tuning. Nobody had actually tested it against realistic attack scenarios.
My Detection Testing Framework
Monthly: Synthetic Attacks
Simulate common attack techniques
Test detection and alerting
Measure response time
Validate escalation procedures
Quarterly: Red Team Exercises
Professional attackers test your defenses
Realistic attack scenarios
Identifies gaps in detection coverage
Tests entire response chain
Annual: Purple Team Exercises
Red team attacks, blue team defends, both collaborate
Improves both detection and response
Shares knowledge across teams
Builds organizational capability
Continuous: Alert Validation
Every alert should be reviewed
Track true vs false positives
Identify gaps in detection
Tune rules based on feedback
DE.DP-4: Event Detection Communication (Getting the Right Information to the Right People)
Communication failures kill incident response.
I watched a breach unfold in 2019 where the SOC detected the attack at 10:47 PM. They created a ticket in the system and went home at 11 PM (end of shift).
The ticket sat in a queue until 8:30 AM the next morning.
By then, the attackers had encrypted 40% of the company's file servers.
The SOC did their job—they detected and documented. But nobody told anyone who could actually DO anything about it.
Communication Protocols That Work
Here's my standard communication matrix:
Severity | Initial Notification | Time Frame | Method | Escalation |
|---|---|---|---|---|
CRITICAL | SOC → Security Manager → CISO | Immediate | Phone call + SMS | Auto-escalate in 15 min if no response |
HIGH | SOC → Security Manager | < 30 minutes | Phone call | Escalate to CISO in 1 hour |
MEDIUM | SOC → Security Team | < 2 hours | Ticket + Email | Escalate if no acknowledgment in 4 hours |
LOW | Ticket system | < 8 hours | Ticket | Standard queue |
INFORMATIONAL | Daily digest | Next business day | Email report | None |
Critical rule: If it's important enough to alert on, it's important enough to ensure someone sees it immediately.
DE.DP-5: Continuous Improvement (Learning From Every Detection)
Every detection—whether true positive or false positive—is a learning opportunity.
I implemented a post-detection review process for a financial services company in 2020. After every alert investigation (not just incidents), analysts documented:
What triggered the alert?
Was it a true or false positive?
How long did investigation take?
What could improve detection?
What could improve response?
Six months of this data revealed something fascinating:
Finding | Impact | Action Taken |
|---|---|---|
40% of alerts were duplicate notifications from multiple sources | Wasted 160 analyst hours/month | Consolidated alerting, saved $38k/month |
3 types of false positives accounted for 60% of false alerts | Analyst burnout, missed real threats | Tuned 3 rules, FP rate dropped 60% |
80% of critical alerts occurred during shift changes | Delayed response by 15-45 minutes | Implemented shift overlap, response time improved 72% |
Analysts spent 30% of time gathering context | Slow investigations | Automated context enrichment, investigation time cut 35% |
The ROI on this improvement process? We calculated over $450,000 in annual savings from efficiency gains alone. The improved threat detection? Priceless.
Building Your Detection Program: A Practical Roadmap
Alright, enough theory. Let me give you the exact roadmap I use to build detection programs:
Phase 1: Foundation (Months 1-3)
Week 1-2: Asset Inventory
What systems do you have?
What data do they contain?
What's their criticality?
Week 3-4: Quick Wins
Deploy basic endpoint protection
Enable logging on critical systems
Implement failed authentication monitoring
Set up basic network monitoring
Weeks 5-8: Initial Baselines
Collect 30 days of normal activity data
Establish preliminary thresholds
Document known anomalies
Train team on new tools
Weeks 9-12: Detection Use Cases
Implement top 10 critical detections
Configure initial alerting
Establish on-call procedures
Begin incident response documentation
Phase 2: Enhancement (Months 4-6)
Month 4: Correlation and Context
Implement SIEM or log correlation
Build correlation rules
Add context enrichment
Tune initial detection rules
Month 5: Advanced Detection
Add behavioral analytics
Implement user activity monitoring
Deploy additional sensors
Expand detection coverage
Month 6: Process Refinement
Document all detection procedures
Conduct first purple team exercise
Review and optimize alert workflows
Implement continuous improvement process
Phase 3: Maturity (Months 7-12)
Month 7-8: Automation
Automate routine investigations
Implement automated response for known threats
Build detection playbooks
Create automated reporting
Month 9-10: Testing and Validation
Regular red team exercises
Monthly detection testing
Quarterly comprehensive assessments
Annual program review
Month 11-12: Optimization
Advanced threat hunting
Machine learning integration
Third-party integration
Continuous tuning and improvement
The Metrics That Actually Matter
Let me share the dashboard I use to track detection program effectiveness:
Metric | Target | Why It Matters | How to Measure |
|---|---|---|---|
Mean Time to Detect (MTTD) | < 24 hours | Industry average is 207 days | Time from compromise to detection |
Mean Time to Investigate (MTTI) | < 2 hours | Speed of investigation matters | Time from alert to initial assessment |
Mean Time to Contain (MTTC) | < 4 hours | Limit attacker dwell time | Time from detection to containment |
False Positive Rate | < 5% | Analyst efficiency and effectiveness | FP alerts / total alerts |
Detection Coverage | > 90% | How much of environment is monitored | Monitored assets / total assets |
Alert Tuning Efficiency | < 2% recurring FPs | Quality of detection rules | Repeated FP patterns |
Critical System Visibility | 100% | No blind spots in critical areas | Critical systems monitored |
Common Detection Mistakes (And How to Avoid Them)
After 15 years, I've seen these mistakes over and over:
Mistake #1: Collecting Without Analyzing
The Problem: Organizations collect every log from every system and never look at them.
The Fix: Start small. Monitor what you can actually analyze. Add sources as you build capability.
Mistake #2: Alerting Without Response
The Problem: Alerts trigger but nobody responds or they overwhelm the team.
The Fix: Every alert needs an owner and a process. No exceptions.
Mistake #3: Static Thresholds
The Problem: Set thresholds once and never adjust them as business changes.
The Fix: Review thresholds quarterly. Implement dynamic thresholds where possible.
Mistake #4: Tool-First Approach
The Problem: Buy expensive tools without understanding what you need to detect.
The Fix: Define detection requirements first. Then select tools that meet those requirements.
Mistake #5: No Testing
The Problem: Assume detection works without validation.
The Fix: Test regularly. Red team quarterly. Validate after every configuration change.
Your Next Steps
If you're building or improving a detection program, here's what I recommend:
This Week:
Inventory your current detection capabilities
Identify your three biggest blind spots
Document who's responsible for detection activities
Review your most recent security alerts
This Month:
Establish baselines for critical systems
Implement your first correlation rule
Test one detection use case
Document your detection procedures
This Quarter:
Deploy comprehensive monitoring on critical assets
Build out your top 10 detection use cases
Conduct first detection testing exercise
Implement a continuous improvement process
This Year:
Achieve 90% detection coverage
Reduce MTTD to under 24 hours
Build automated response for common threats
Establish mature detection operations
The Bottom Line: Detection Is Not Optional
Here's what fifteen years in cybersecurity has taught me: you're going to get attacked. It's not if, it's when.
The question isn't whether threats will target you. The question is whether you'll know about it when they do.
I've seen organizations survive devastating attacks because they had solid detection. I've watched others crumble under breaches that went undetected for months.
The difference? The NIST Detect function, properly implemented.
Don't be the organization that discovers a breach from the FBI. Don't be the company that reads about their own breach in the news. Don't be the CISO trying to explain to the board how attackers were in your network for 11 months without anyone noticing.
Build detection. Test detection. Trust but verify detection.
Because in cybersecurity, what you don't know absolutely can hurt you.
And what you detect early, you can stop before it becomes catastrophic.
"The best security programs don't prevent every attack. They detect every attack that matters and respond before it becomes a crisis."