01 / 16
Matthew Evans
Consulting Engineer, WWT
CISSP · Charlotte, NC
Infrastructure background. Today I spend more than half my day writing code.
Credentials
By the Numbers
20+
Years in Cloud
& Reliability Engineering
& Reliability Engineering
280+
Production Services
Migrated to Kubernetes
Migrated to Kubernetes
99.99%
Reliability Achieved
on Re-Architected Platforms
on Re-Architected Platforms
Insurance
Financial Services
Security SaaS
Hedge Fund
Banking
Aviation
2011
Before SRE had a name
Automating toil · Reducing alert noise · Enabling confident decisions
Definition
SRE is a Framework
Business
Strategy
Strategy
Engineering
Execution
Execution
A shared language for what you need and what it costs
The Core Concept
Shared Language
Business articulates
what they need
what they need
↔
Engineering responds with
what it costs
what it costs
Effort · Risk · Investment
01
Blameless Culture
When incidents happen, we focus on what broke and how to prevent it.
Never who broke it.
02
Consistent Decision-Making
Business Strategy
→
LOB Priorities
→
Engineering
Same playbook. Traceable decisions.
03
Right-Sized Investment
Monitor what the business actually needs.
No over-spending on unused tools.
No blind spots that cause surprises.
No blind spots that cause surprises.
SRE isn't about saying anyone did anything wrong. It's about giving those same talented people a consistent framework so they can communicate more effectively with the business.
It makes everyone's life easier.
Real-World Examples
Three Organizations, Three Angles
Hedge Fund
Process & VisibilitySecurity SaaS
Incident Mgmt & AlertsFintech
Cost OptimizationOrganization A · Hedge Fund
The Situation
- 15-year-old deployment pipeline
- Documented in a single Word doc
- Engineers SSH'd into servers to troubleshoot
- Identical servers behaved differently in production
- Zero end-to-end visibility
Organization A · Hedge Fund
The Result
Human deployment errors dropped to near-zero
"don't touch it"
→
"deploy with confidence"
Organization B · Security SaaS
5,000
alerts
Military · Government · Energy · Financial · Medical
No incident process. Engineers couldn't self-serve diagnostics.
Organization B · Security SaaS
Every remaining alert: genuinely actionable
Organization C · Fintech
"We're overpaying
for our stage"
They were right.
Organization C · Fintech
$10k
Per Month
→
$1k
Per Month
90% reduction
+ Regional failover they didn't have before
Key Takeaway
SRE is a Framework,
Not a Criticism
Org A
Process & Visibility
Deployment confidence
Org B
Alerts & Incidents
5,000 → 5 alerts
Org C
Cost Optimization
90% spend reduction
Discussion
Let's Talk About Your World
- What does your team spend the most time on that feels repetitive or manual?
- When an incident happens today, what does the first 15 minutes look like?
- If you could have one piece of data you don't have today, what would it be?