Reliability is a product, not a tax.
Why operational excellence belongs on the roadmap — with owners and KPIs — not buried in overhead.
Leading cloud infrastructure, reliability, and AI-driven operations across mission-critical Azure platforms.
25 years in technology. 15 years at Microsoft. Leading global teams that keep business-critical systems running quietly.
The best operations teams are invisible.
Customers never notice reliability. They only notice failure. The work is making complexity disappear — quietly, repeatedly, at scale.
A leader is hired to move numbers and reduce risk. These are the ones I move.
Capacity right-sizing, reservation strategy, and elimination of idle infrastructure across an Azure data platform operating 24x7.
Synapse, Cosmos DB, and Fabric supporting global marketing and customer-lifecycle workloads. Operational readiness embedded in every change.
Automation-first runbooks, standardized incident response, and operational reviews rebuilt around measurable outcomes rather than activity.
Infrastructure engineering, data operations, and program management. Calm under load. Low attrition. High ownership.
Incident intelligence, cost forecasting, knowledge automation, and runbook copilots — grounded in telemetry, not demoware.
Twenty-five years tracing the arc of cloud \u2014 building, operating, and now reshaping it with AI.
For three years I have been applying LLMs and automation to the operations stack itself. The aim is the same as every other choice in operations: faster signal, cheaper toil, fewer surprises.
Pattern recognition across historical incidents to surface likely cause classes within minutes of a page.
Tribal operational knowledge made searchable, queryable, and grounded in primary telemetry.
In-context assistants for on-call engineers — attached to dashboards, not a separate chat window.
Forecasting, anomaly detection, and exec-grade Azure spend reporting surfaced before invoicing.
LLM-assisted post-incident reviews that compress the time from outage to written, accountable RCA.
Living runbooks that propose the next safe action, parameterized by the current state of the system.
Why operational excellence belongs on the roadmap — with owners and KPIs — not buried in overhead.
A working list of LLM use cases that have survived production contact in a Microsoft data platform.
Notes on running global on-call teams without burning them out — or losing pager discipline.
Direct to inbox. No forms.