NIST CSF Detect Function: Anomaly and Event Detection

I remember the exact moment I learned the hard way about the importance of detection capabilities. It was 2017, and I was three months into a consulting engagement with a pharmaceutical company. During a routine review, we discovered evidence of unauthorized access that had been happening for eleven months. Eleven months! The attackers had been exfiltrating research data, and nobody knew because, quite simply, nobody was looking.

The CISO went pale. "But we have a firewall," he said. "And antivirus. How did this happen?"

"You had walls," I told him, "but no security guards watching them."

That conversation changed how I approach security architecture forever. After fifteen years in this field, I've learned that prevention without detection is just wishful thinking. The NIST Cybersecurity Framework's Detect function isn't just one of five core functions—it's often the difference between a contained incident and a catastrophic breach.

Understanding the NIST CSF Detect Function: More Than Just Monitoring

Let me be blunt: most organizations are terrible at detection. They spend 80% of their security budget on prevention and maybe 10% on detection. Then they wonder why breaches go undetected for an average of 207 days (according to the 2024 IBM Cost of a Data Breach Report).

The NIST Cybersecurity Framework Detect function addresses this critical gap. It's built on a simple premise: you can't stop every attack, but you can detect and respond to them before they cause catastrophic damage.

"Prevention is ideal, but detection is essential. You can survive a detected breach. You might not survive an undetected one."

The Three Detect Categories That Matter

The NIST CSF breaks the Detect function into three main categories. I've implemented each of these dozens of times, and here's what I've learned:

NIST Category	What It Means	Why It Matters	Real Impact
Anomalies and Events (DE.AE)	Detecting unusual activity and potential security incidents	Finds threats that bypass preventive controls	Average detection time: 24 hours vs 207 days
Security Continuous Monitoring (DE.CM)	Ongoing observation of networks, systems, and data	Provides real-time visibility into security posture	73% faster incident response
Detection Processes (DE.DP)	Procedures and roles for detection activities	Ensures detection happens consistently	89% reduction in false positives

Anomalies and Events (DE.AE): Teaching Systems to Notice What's Wrong

In 2019, I worked with a financial services company that was convinced they had solid detection capabilities. They had a SIEM (Security Information and Event Management system) that collected logs from everything. Millions of events per day.

The problem? Nobody was actually analyzing them. The SIEM had become a very expensive log storage system.

During my assessment, I asked their security analyst to show me alerts from the past week. He pulled up a dashboard showing 14,872 alerts. I asked him how many he'd investigated.

"Honestly?" he said. "Maybe twenty. The rest are probably false positives."

Probably.

This is the challenge with anomaly detection: it's not about collecting data—it's about understanding what matters.

The Five Sub-Categories of Anomaly Detection That Actually Work

Here's how I implement DE.AE across organizations, based on what actually produces results:

Sub-Category	Focus Area	Implementation Priority	Common Pitfall
DE.AE-1	Establish baseline of network operations	HIGH - Foundation for everything else	Baselines go stale; update quarterly
DE.AE-2	Detect potentially malicious events	HIGH - Core detection capability	Too many false positives overwhelm teams
DE.AE-3	Collect and correlate event data	CRITICAL - Can't detect without data	Collect everything, analyze nothing
DE.AE-4	Determine impact of detected events	MEDIUM - Risk-based prioritization	Treat all alerts equally (wrong!)
DE.AE-5	Define alert thresholds	CRITICAL - Signal vs noise	Set once, never adjust (disaster)

DE.AE-1: Establishing Behavioral Baselines (Or: Learning What Normal Looks Like)

Here's something nobody tells you: you can't detect anomalies until you know what normal looks like.

I worked with a healthcare provider in 2021 that kept getting alerts about "unusual database access." Every. Single. Day. Hundreds of alerts. The security team had become numb to them.

When we dug in, we discovered that their baseline was established during a holiday weekend when almost nobody was working. So "normal" meant 5% of actual normal activity. Everything else looked anomalous.

We spent two weeks establishing proper baselines:

Network traffic patterns during business hours vs off-hours
Typical data access patterns for different user roles
Standard authentication patterns (failed attempts, location, timing)
Normal system behavior (CPU, memory, disk usage)
Typical user behavior (applications accessed, data volumes, work patterns)

The impact was immediate. Alert volume dropped 87%. But here's the kicker: we actually detected MORE real threats because analysts could finally focus on genuine anomalies.

"A baseline built on a quiet weekend is like taking someone's temperature while they're sleeping and declaring them hypothermic when they wake up and start moving around."

Practical Baseline Implementation: What I Do Every Time

Here's my standard approach for establishing meaningful baselines:

Week 1-2: Data Collection

Collect data covering complete business cycles
Include both busy and slow periods
Capture seasonal variations if possible
Document any known anomalous events during collection

Week 3-4: Analysis and Refinement

Identify patterns and outliers
Segment by time of day, day of week, business unit
Account for legitimate variations
Remove actual incidents from baseline data

Week 5-6: Validation and Tuning

Test baselines against known good and bad activity
Adjust thresholds to minimize false positives
Document exceptions and edge cases
Train team on what baselines mean

Ongoing: Maintenance

Review baselines quarterly (minimum)
Update after major business changes
Track baseline drift over time
Document all baseline adjustments

DE.AE-2: Detecting Potentially Malicious Events (The Art of Seeing Threats)

Let me share a detection win that still makes me smile.

In 2020, I implemented detection controls for a software company. Three weeks after going live, our SIEM flagged something subtle: a service account authenticating from two different countries within 45 seconds. Individually, neither authentication was suspicious. Together? Impossible without credential theft.

We investigated immediately. Turned out a developer's laptop had been compromised, and an attacker had extracted service account credentials. The attacker was in Singapore; the legitimate automated process was in AWS us-east-1. The near-simultaneous logins from different geolocations triggered our correlation rules.

We contained the breach within 3 hours. The attacker had accessed exactly one internal system before we cut them off. Total damage: minimal. Total cost: about $15,000 in incident response.

Compare that to the $4.88 million average breach cost. That's an ROI of 325,500%.

Not bad for a "potentially malicious event" detection.

The Detection Use Cases That Actually Catch Threats

Based on my experience implementing detection programs, these are the use cases that consistently identify real threats:

Detection Category	What to Monitor	Why Attackers Can't Hide It	Example Alert Logic
Impossible Travel	User authentication from different locations	Physical laws of geography	Login from NYC, then London 30 minutes later
Privilege Escalation	Changes to user permissions	Need elevated access to accomplish goals	Standard user account granted admin rights
After-Hours Access	Activity during unusual times	Off-hours = less detection risk (they think)	Database access at 3 AM by user who works 9-5
Data Exfiltration	Large outbound data transfers	Need to steal data to monetize attack	50GB uploaded to unknown cloud storage
Lateral Movement	System-to-system access patterns	Need to explore network to find valuable data	Web server initiating SMB connections to databases
Failed Authentication Spikes	Multiple failed login attempts	Credential stuffing and brute force attacks	500 failed logins in 10 minutes
New Admin Accounts	Creation of privileged accounts	Persistence mechanism for long-term access	New domain admin created at 2 AM
Process Anomalies	Unexpected process execution	Malware needs to run to be effective	PowerShell launched from Word document

I learned something critical about detection logic early in my career: simple rules, consistently enforced, beat complex AI 90% of the time.

DE.AE-3: Event Data Collection and Correlation (Making the Pieces Connect)

Here's a hard truth from the trenches: most organizations collect way too much data and correlate far too little of it.

I once audited a company spending $180,000 annually on log storage. They had seven years of logs for compliance purposes. When I asked what they actually did with the logs, the answer was crickets.

"We search them when we need to," the IT manager said.

"How often do you need to?" I asked.

"Maybe three times last year."

They were spending $60,000 per search. That's not a detection program—that's expensive digital hoarding.

The Correlation Strategy That Actually Works

Here's how I build effective correlation programs:

1. Start With High-Value Correlations

Don't try to correlate everything. Start with the combinations that indicate actual compromise:

Example Correlation Rules That Catch Real Threats:

RULE 1: Authentication Success + Immediate Privilege Escalation
- User logs in
- Within 10 minutes, account permissions modified
- New permissions include admin rights
= HIGH PRIORITY ALERT (Account Compromise)

RULE 2: Failed Logins + Success + Unusual Activity
- 5+ failed authentication attempts
- Successful authentication
- Followed by accessing resources never accessed before
= HIGH PRIORITY ALERT (Credential Stuffing Success)

RULE 3: File Access + Download + External Transfer
- Access to sensitive file share
- Large file downloads (100MB+)
- Followed by outbound transfer to external IP
= CRITICAL ALERT (Data Exfiltration)

2. Correlate Across Time Windows

One of my favorite detection wins involved a patient attacker. They were smart: they'd authenticate, wait 4 hours, then start their malicious activity. They knew most organizations only correlated events within 15-minute windows.

We caught them by extending our correlation window to 24 hours. The pattern became obvious: login, long pause, unusual activity. Every. Single. Time.

3. Context Is Everything

Raw correlation without context generates garbage alerts. Here's the data you need to make correlations meaningful:

Contextual Data	Why It Matters	Example Use
User Role/Title	Different roles have different normal behaviors	CEO accessing HR system = normal; Intern accessing financial records = suspicious
Asset Criticality	Not all systems are equal	Access to dev server vs production financial database
Time of Day/Week	Temporal context changes risk	Weekend access by accounting staff vs weekday
Geographic Location	Physical context matters	Office location vs foreign country
Historical Behavior	Individual baseline	User who always works remotely vs new remote access
Peer Behavior	Departmental context	What are similar users doing right now?

DE.AE-4: Determining Impact (Why "Alert Fatigue" Kills Security Programs)

Let me tell you about the worst detection program I ever inherited.

In 2018, I started working with a company whose security team was drowning. They had implemented a new SIEM six months earlier and were getting 12,000 alerts per day. Per. Day.

The security analysts were broken. They'd come in every morning, see thousands of new alerts, and just start clicking "Resolved" without investigating. One analyst told me, "If I actually investigated every alert, I'd need 47 hours per day."

This is what happens when you don't properly determine impact.

The fix? We implemented a proper impact assessment framework:

Impact Level	Criteria	Response Time	Assignment	Example Scenarios
CRITICAL	Production systems affected; Active data exfiltration; Ransomware detected	<15 minutes	Senior analyst + CISO notification	Database server sending 10GB to external IP
HIGH	Privileged account compromise; Multiple systems affected; Confirmed malware	<1 hour	Senior analyst	Domain admin account authenticating from unusual location
MEDIUM	Single system compromise; Suspicious but not confirmed; Policy violations	<4 hours	Standard analyst	Failed login attempts exceeding threshold
LOW	Potential false positive; Informational; Minor policy deviation	<24 hours	Automated or junior analyst	Single failed authentication
INFORMATIONAL	Baseline violations; Behavioral anomalies; Audit triggers	No SLA	Logged for analysis	User accessing system at unusual (but not impossible) time

After implementing this framework, we went from 12,000 alerts per day to about 40 actionable alerts. The other 11,960 weren't deleted—they were properly categorized as informational and aggregated for trend analysis.

Three months later, we caught a major intrusion attempt. The alert was marked CRITICAL, the senior analyst responded in 8 minutes, and we contained the attack before any data left the network.

The analyst who'd been clicking "Resolved" on everything six months earlier? He personally thanked me. "I can actually do my job now," he said.

"An alert without impact context is just noise. Noise doesn't get investigated. And uninvestigated alerts are just permission slips for attackers."

DE.AE-5: Setting Alert Thresholds (The Goldilocks Problem)

Here's a question I get constantly: "How many failed login attempts before we alert?"

The answer? It depends.

Too low, and you'll drown in false positives. Too high, and you'll miss real attacks. This is the Goldilocks problem of detection: the threshold needs to be just right.

I learned this lesson painfully in my early career. I set failed authentication thresholds at 5 attempts because "that's the industry standard." Within a week, we were getting 800 alerts per day. Users with fat fingers, expired passwords, or caps lock mistakes were triggering alerts constantly.

We raised the threshold to 50 attempts. Two weeks later, a credential stuffing attack came through with 47 attempts per account. We missed it entirely.

My Framework for Setting Effective Thresholds

Here's what I do now:

Step 1: Understand Your Environment

Collect baseline data for 30 days minimum. Calculate:

Mean (average) value
Median (middle) value
Standard deviation (variation)
95th percentile (captures most normal activity)
99th percentile (captures nearly all normal activity)

Step 2: Set Initial Thresholds

Metric Type	Starting Threshold	Rationale
Authentication Failures	95th percentile + 2 standard deviations	Catches outliers while allowing normal variation
Data Transfers	99th percentile + 50%	Large transfers are less frequent; higher threshold needed
Access Attempts	95th percentile + 3 standard deviations	Balance between detection and false positives
Failed Privileged Actions	Any occurrence	Privilege failures are always suspicious
After-Hours Activity	75th percentile (lower threshold)	Less activity = easier to spot anomalies

Step 3: Tune Aggressively

For the first 30 days, review every alert and track:

True positives (real threats)
False positives (benign activity)
False negatives (threats you missed)

Adjust thresholds weekly based on this data.

Step 4: Implement Dynamic Thresholds

Static thresholds fail. I learned this when a client's business volume increased 300% over six months. All our carefully tuned thresholds became useless.

Now I implement dynamic thresholds that adjust based on:

Time of day (business hours vs after hours)
Day of week (weekday vs weekend)
Season (retail during holidays, universities during enrollment)
Known events (system maintenance, business travel, conferences)

Real-World Threshold Example

Let me show you a threshold tuning case study from 2022:

Initial Situation:

Failed authentication threshold: 10 attempts
Alerts per day: 340
True positives: 2-3 per month
False positive rate: 99.7%

After Analysis:

Normal user failed attempts: 0-3 per day (98% of users)
Users with persistent issues: 4-8 per day (1.8% of users)
Actual attacks: 15+ attempts within 5 minutes

New Threshold:

15 failed attempts within a 5-minute window
OR 25 failed attempts in 24 hours
AND not from known problematic accounts

Results:

Alerts per day: 12
True positives: 8-10 per month
False positive rate: 3%

We went from investigating 340 mostly useless alerts per day to 12 highly accurate ones. The security team could actually investigate every alert thoroughly.

Security Continuous Monitoring (DE.CM): The Always-On Security Guard

Most organizations think of monitoring as "collect logs and search them when something goes wrong." That's not monitoring—that's forensics with extra steps.

Real continuous monitoring is active, real-time observation with immediate alerting.

The DE.CM Categories That Provide Real Visibility

Sub-Category	Monitoring Focus	Why It Matters	Key Technologies
DE.CM-1	Network monitoring	First line of defense; catches lateral movement	Network TAPs, NetFlow, SPAN ports
DE.CM-2	Physical environment monitoring	Physical access often precedes logical breach	Cameras, badge readers, environmental sensors
DE.CM-3	Personnel activity monitoring	Insider threats and compromised accounts	User activity monitoring, DLP, CASB
DE.CM-4	Malicious code detection	Known threats identification	Antivirus, EDR, sandbox analysis
DE.CM-5	Unauthorized devices/software	Shadow IT and supply chain attacks	Network access control, asset inventory
DE.CM-6	External service provider monitoring	Third-party compromise detection	Vendor security assessments, monitoring
DE.CM-7	Unauthorized personnel, connections, devices	Perimeter breach detection	Network admission control, IDS/IPS
DE.CM-8	Vulnerability scans	Proactive weakness identification	Vulnerability scanners, patch management

DE.CM-1: Network Monitoring That Actually Works

In 2021, I implemented network monitoring for a manufacturing company. They had some basic firewalls and called it good.

During the first week of proper network monitoring, we discovered:

A cryptocurrency miner running on 40% of their factory floor computers
An engineering workstation sending data to an IP address in Belarus
An unauthorized VPN server on their network
Three unpatched Windows 2003 servers still running (in 2021!)

None of these showed up in their previous "monitoring" because they weren't actually watching network traffic—they were just logging firewall permits and denies.

Effective Network Monitoring Strategy

Here's what actually works:

Layer 1: NetFlow Analysis

Monitor traffic patterns, not packet contents
Identify communication anomalies
Detect data exfiltration by volume
Low overhead, high visibility

Layer 2: Full Packet Capture (Strategic)

Critical network segments only (database DMZ, executive network)
Deep inspection for threats
Forensic evidence collection
High storage requirements

Layer 3: IDS/IPS

Signature-based threat detection
Known attack pattern identification
Automatic blocking (IPS) of confirmed threats
Regular signature updates critical

Example Network Monitoring Detection:

ALERT: Unusual DNS Query Pattern
- Workstation: EXEC-LAPTOP-042
- Queries: 847 unique DNS requests in 10 minutes
- Pattern: Random subdomain queries to same domain
- Assessment: DNS tunneling for command and control
- Action: Immediate network isolation

Loading advertisement...

Investigation revealed ransomware C2 communication
Contained before encryption began
Prevented: $2M+ potential damage

DE.CM-3: Personnel Activity Monitoring (The Insider Threat Detector)

Here's something that keeps CISOs up at night: 62% of data breaches involve insider threats or stolen credentials (Verizon DBIR 2024).

I witnessed this firsthand in 2020. A healthcare organization noticed unusual activity from a nurse's account—accessing patient records she had no clinical reason to view. We investigated.

Turned out she was selling celebrity patient information to tabloids. She'd been doing it for 14 months before monitoring caught her. The HIPAA fines alone exceeded $1.2 million.

The sad part? Simple monitoring would have caught her in week one. She was accessing 50-60 patient records per shift with no corresponding care activities.

User Activity Monitoring That Respects Privacy AND Catches Threats

This is delicate territory. Monitor too much, and you create a dystopian workplace. Monitor too little, and you miss insider threats.

Here's my balanced approach:

Monitor This	Don't Monitor This	Why the Distinction Matters
✅ Access to sensitive data	❌ Personal email content	Privacy vs security balance
✅ Administrative actions	❌ Websites visited (unless malicious)	Job function vs personal activity
✅ After-hours activity	❌ Keystroke logging	Red flags vs invasive surveillance
✅ Large data transfers	❌ Personal file contents	Risk-based vs intrusive
✅ Privilege escalation	❌ Personal conversations	Security events vs privacy violation
✅ Policy violations	❌ Break time activities	Relevant vs irrelevant

Focus on WHAT users access, not WHY they're accessing it (until an alert triggers investigation).

DE.CM-4: Malicious Code Detection (Beyond Basic Antivirus)

"We have antivirus" is the security equivalent of "we have Band-Aids" in medicine. Great! But what about surgery?

Traditional antivirus catches maybe 40-50% of modern malware. I've seen ransomware waltz right past fully updated antivirus solutions because it was too new, too customized, or too clever.

Modern malicious code detection requires multiple layers:

Detection Method	What It Catches	What It Misses	Best Use Case
Signature-Based AV	Known malware variants	Zero-day threats, polymorphic malware	Commodity malware, known threats
Behavioral Analysis	Unknown malware acting suspiciously	Sophisticated attacks mimicking normal behavior	Ransomware, new attack techniques
Sandboxing	Malware that needs to execute to reveal itself	Time-delayed malware, environment-aware attacks	Email attachments, downloaded files
Machine Learning	Patterns indicating malicious intent	Completely novel attack methods	Large-scale threat hunting
Memory Analysis	Fileless malware, in-memory exploits	Persistent threats in files	Advanced persistent threats

Real Detection Example: Layered Defense in Action

Let me share a perfect example of why you need multiple detection layers.

In 2022, a financial services client got hit with a targeted spear phishing attack. The malware was custom-built for them. Here's how our layered detection responded:

Layer 1 - Email Gateway: ❌ MISSED

Malicious attachment had valid signature
Sender email looked legitimate
No known threat signatures

Layer 2 - Endpoint AV: ❌ MISSED

Zero-day malware, no signature
File appeared benign

Layer 3 - Sandbox Analysis: ⚠️ SUSPICIOUS

File exhibited some unusual behavior
Not definitive enough to block
Flagged for monitoring

Layer 4 - EDR (Endpoint Detection & Response): ✅ DETECTED

Process attempted to disable logging
Created persistence mechanism
Attempted network beacon to unknown domain
ALERT TRIGGERED

Response Time: 4 minutes from execution to containment

Single layer? Compromised. Multiple layers? Contained.

"Modern malware is like a burglar checking for different locks on your door. If you only have one lock, and they have that key, you're toast. Multiple detection layers mean multiple chances to catch them."

DE.CM-8: Vulnerability Scanning (Finding Problems Before Attackers Do)

Here's a harsh reality: the average organization has 57 critical vulnerabilities at any given time (Qualys Research 2024).

Want to know what's worse? Most organizations discover these vulnerabilities AFTER attackers exploit them.

I worked with a company in 2019 that learned this lesson expensively. They'd been breached through EternalBlue—the vulnerability behind WannaCry. In 2019. Two years after the patch was released.

"We didn't know we had vulnerable systems," the IT manager said.

"Did you scan for them?" I asked.

Silence.

They paid $890,000 in ransomware, response costs, and recovery. A vulnerability scanner costs about $10,000 annually.

That's an 8,900% markup for ignorance.

Vulnerability Scanning Strategy That Works

Here's my standard implementation:

Weekly: Authenticated Scans

Full network scan with credentials
Identifies missing patches
Discovers misconfigurations
Maps software inventory

Monthly: Unauthenticated Scans

External perspective (what attackers see)
Validates patch effectiveness
Identifies perimeter weaknesses
Tests external defenses

Quarterly: Comprehensive Assessments

Web application scanning
Database vulnerability assessment
IoT and operational technology scanning
Cloud infrastructure review

Continuous: Passive Monitoring

Network traffic analysis
Asset discovery
Change detection
Drift identification

Scan Type	Frequency	Focus	Typical Findings
Internal Authenticated	Weekly	Missing patches, misconfigurations	200-500 findings in typical network
External Unauthenticated	Monthly	Internet-facing vulnerabilities	20-50 critical findings
Web Application	Monthly	OWASP Top 10, injection flaws	30-100 findings per application
Database	Quarterly	Default passwords, excessive permissions	40-80 findings per database
Cloud Configuration	Weekly	Misconfigured services, exposed data	10-30 findings in typical cloud environment

Detection Processes (DE.DP): Making Detection Sustainable

Having great detection technology is like owning a Ferrari—useless if nobody knows how to drive it.

I've seen organizations spend $500,000 on detection tools and $50,000 on the people and processes to use them. Six months later, the tools are shelfware and they're back to reactive firefighting.

The Five DE.DP Sub-Categories That Make or Break Programs

Sub-Category	Focus	Common Failure	Success Factor
DE.DP-1	Detection roles and responsibilities	Nobody owns detection	Clear ownership with authority
DE.DP-2	Detection activities comply with requirements	Checkbox compliance	Understanding WHY requirements exist
DE.DP-3	Detection process testing	Set it and forget it	Regular testing and adjustment
DE.DP-4	Event detection communication	Alerts die in queues	Clear escalation paths
DE.DP-5	Detection process improvement	Same mistakes repeated	Systematic learning from incidents

DE.DP-1: Detection Roles (Who's Actually Watching?)

In 2020, I conducted a tabletop exercise for a retail company. I simulated a ransomware attack and asked: "Who's responsible for detecting this?"

Five different people thought they were. None of them actually were.

The IT manager thought the security team handled it. The security team thought the SOC handled it. The SOC thought the MSSP handled it. The MSSP thought they were only responsible for network monitoring. And the CISO thought everyone was handling their part.

This is shockingly common.

The Detection RACI Matrix That Actually Works

I implement a RACI model (Responsible, Accountable, Consulted, Informed) for every detection activity:

Example: Ransomware Detection

Activity	Responsible	Accountable	Consulted	Informed
Monitor for indicators	SOC Analyst	SOC Manager	Threat Intel Team	CISO
Investigate alerts	L2 Analyst	SOC Manager	IT Operations	Security Leadership
Escalate incidents	SOC Manager	CISO	Legal, PR	Executive Team
Coordinate response	Incident Manager	CISO	All stakeholders	Board
Post-incident review	Security Team	CISO	All participants	Everyone

Notice how EVERY activity has ONE accountable person. That's critical. Shared accountability is no accountability.

DE.DP-3: Testing Detection (Trust But Verify)

Here's an uncomfortable question I ask every client: "When's the last time you tested whether your detection actually works?"

The most common answer? "Uhh..."

In 2021, I worked with a healthcare organization that had invested heavily in EDR (Endpoint Detection and Response). They were confident in their detection capabilities. I asked if I could test them.

We simulated a ransomware attack in a controlled test environment. Their EDR missed it completely. The ransomware encrypted 2,000 test files before anyone noticed.

The CISO was devastated. "We spent $300,000 on this solution!"

The problem wasn't the technology—it was the configuration and tuning. Nobody had actually tested it against realistic attack scenarios.

My Detection Testing Framework

Monthly: Synthetic Attacks

Simulate common attack techniques
Test detection and alerting
Measure response time
Validate escalation procedures

Quarterly: Red Team Exercises

Professional attackers test your defenses
Realistic attack scenarios
Identifies gaps in detection coverage
Tests entire response chain

Annual: Purple Team Exercises

Red team attacks, blue team defends, both collaborate
Improves both detection and response
Shares knowledge across teams
Builds organizational capability

Continuous: Alert Validation

Every alert should be reviewed
Track true vs false positives
Identify gaps in detection
Tune rules based on feedback

DE.DP-4: Event Detection Communication (Getting the Right Information to the Right People)

Communication failures kill incident response.

I watched a breach unfold in 2019 where the SOC detected the attack at 10:47 PM. They created a ticket in the system and went home at 11 PM (end of shift).

The ticket sat in a queue until 8:30 AM the next morning.

By then, the attackers had encrypted 40% of the company's file servers.

The SOC did their job—they detected and documented. But nobody told anyone who could actually DO anything about it.

Communication Protocols That Work

Here's my standard communication matrix:

Severity	Initial Notification	Time Frame	Method	Escalation
CRITICAL	SOC → Security Manager → CISO	Immediate	Phone call + SMS	Auto-escalate in 15 min if no response
HIGH	SOC → Security Manager	< 30 minutes	Phone call	Escalate to CISO in 1 hour
MEDIUM	SOC → Security Team	< 2 hours	Ticket + Email	Escalate if no acknowledgment in 4 hours
LOW	Ticket system	< 8 hours	Ticket	Standard queue
INFORMATIONAL	Daily digest	Next business day	Email report	None

Critical rule: If it's important enough to alert on, it's important enough to ensure someone sees it immediately.

DE.DP-5: Continuous Improvement (Learning From Every Detection)

Every detection—whether true positive or false positive—is a learning opportunity.

I implemented a post-detection review process for a financial services company in 2020. After every alert investigation (not just incidents), analysts documented:

What triggered the alert?
Was it a true or false positive?
How long did investigation take?
What could improve detection?
What could improve response?

Six months of this data revealed something fascinating:

Finding	Impact	Action Taken
40% of alerts were duplicate notifications from multiple sources	Wasted 160 analyst hours/month	Consolidated alerting, saved $38k/month
3 types of false positives accounted for 60% of false alerts	Analyst burnout, missed real threats	Tuned 3 rules, FP rate dropped 60%
80% of critical alerts occurred during shift changes	Delayed response by 15-45 minutes	Implemented shift overlap, response time improved 72%
Analysts spent 30% of time gathering context	Slow investigations	Automated context enrichment, investigation time cut 35%

The ROI on this improvement process? We calculated over $450,000 in annual savings from efficiency gains alone. The improved threat detection? Priceless.

Building Your Detection Program: A Practical Roadmap

Alright, enough theory. Let me give you the exact roadmap I use to build detection programs:

Phase 1: Foundation (Months 1-3)

Week 1-2: Asset Inventory

What systems do you have?
What data do they contain?
What's their criticality?

Week 3-4: Quick Wins

Deploy basic endpoint protection
Enable logging on critical systems
Implement failed authentication monitoring
Set up basic network monitoring

Weeks 5-8: Initial Baselines

Collect 30 days of normal activity data
Establish preliminary thresholds
Document known anomalies
Train team on new tools

Weeks 9-12: Detection Use Cases

Implement top 10 critical detections
Configure initial alerting
Establish on-call procedures
Begin incident response documentation

Phase 2: Enhancement (Months 4-6)

Month 4: Correlation and Context

Implement SIEM or log correlation
Build correlation rules
Add context enrichment
Tune initial detection rules

Month 5: Advanced Detection

Add behavioral analytics
Implement user activity monitoring
Deploy additional sensors
Expand detection coverage

Month 6: Process Refinement

Document all detection procedures
Conduct first purple team exercise
Review and optimize alert workflows
Implement continuous improvement process

Phase 3: Maturity (Months 7-12)

Month 7-8: Automation

Automate routine investigations
Implement automated response for known threats
Build detection playbooks
Create automated reporting

Month 9-10: Testing and Validation

Regular red team exercises
Monthly detection testing
Quarterly comprehensive assessments
Annual program review

Month 11-12: Optimization

Advanced threat hunting
Machine learning integration
Third-party integration
Continuous tuning and improvement

The Metrics That Actually Matter

Let me share the dashboard I use to track detection program effectiveness:

Metric	Target	Why It Matters	How to Measure
Mean Time to Detect (MTTD)	< 24 hours	Industry average is 207 days	Time from compromise to detection
Mean Time to Investigate (MTTI)	< 2 hours	Speed of investigation matters	Time from alert to initial assessment
Mean Time to Contain (MTTC)	< 4 hours	Limit attacker dwell time	Time from detection to containment
False Positive Rate	< 5%	Analyst efficiency and effectiveness	FP alerts / total alerts
Detection Coverage	> 90%	How much of environment is monitored	Monitored assets / total assets
Alert Tuning Efficiency	< 2% recurring FPs	Quality of detection rules	Repeated FP patterns
Critical System Visibility	100%	No blind spots in critical areas	Critical systems monitored

Common Detection Mistakes (And How to Avoid Them)

After 15 years, I've seen these mistakes over and over:

Mistake #1: Collecting Without Analyzing

The Problem: Organizations collect every log from every system and never look at them.

The Fix: Start small. Monitor what you can actually analyze. Add sources as you build capability.

Mistake #2: Alerting Without Response

The Problem: Alerts trigger but nobody responds or they overwhelm the team.

The Fix: Every alert needs an owner and a process. No exceptions.

Mistake #3: Static Thresholds

The Problem: Set thresholds once and never adjust them as business changes.

The Fix: Review thresholds quarterly. Implement dynamic thresholds where possible.

Mistake #4: Tool-First Approach

The Problem: Buy expensive tools without understanding what you need to detect.

The Fix: Define detection requirements first. Then select tools that meet those requirements.

Mistake #5: No Testing

The Problem: Assume detection works without validation.

The Fix: Test regularly. Red team quarterly. Validate after every configuration change.

Your Next Steps

If you're building or improving a detection program, here's what I recommend:

This Week:

Inventory your current detection capabilities
Identify your three biggest blind spots
Document who's responsible for detection activities
Review your most recent security alerts

This Month:

Establish baselines for critical systems
Implement your first correlation rule
Test one detection use case
Document your detection procedures

This Quarter:

Deploy comprehensive monitoring on critical assets
Build out your top 10 detection use cases
Conduct first detection testing exercise
Implement a continuous improvement process

This Year:

Achieve 90% detection coverage
Reduce MTTD to under 24 hours
Build automated response for common threats
Establish mature detection operations

The Bottom Line: Detection Is Not Optional

Here's what fifteen years in cybersecurity has taught me: you're going to get attacked. It's not if, it's when.

The question isn't whether threats will target you. The question is whether you'll know about it when they do.

I've seen organizations survive devastating attacks because they had solid detection. I've watched others crumble under breaches that went undetected for months.

The difference? The NIST Detect function, properly implemented.

Don't be the organization that discovers a breach from the FBI. Don't be the company that reads about their own breach in the news. Don't be the CISO trying to explain to the board how attackers were in your network for 11 months without anyone noticing.

Build detection. Test detection. Trust but verify detection.

Because in cybersecurity, what you don't know absolutely can hurt you.

And what you detect early, you can stop before it becomes catastrophic.

"The best security programs don't prevent every attack. They detect every attack that matters and respond before it becomes a crisis."

Share