Marsh is seeking a visionary, transformational leader to reimagine and rebuild our Observability and Site Reliability Engineering function from the ground up. This is not a role for someone who wants to maintain the status quo. We need a leader who will fundamentally shift this function to a predictive, data-driven engineering discipline that prevents outages before they happen, embeds reliability into every system from design through production, and treats observability data as a strategic asset - not just an operational tool.
This is a career-defining opportunity to build a world-class observability and SRE organization at Fortune 500 scale.
Job Responsibilities
STRATEGIC VISION & PLATFORM TRANSFORMATION
Define and execute an observability and SRE strategy that shifts the organization from reactive operations to predictive reliability engineering.
Architect and deliver a unified, full-stack observability platform covering metrics, traces, logs, real-user monitoring (RUM), synthetic monitoring, and business-level KPIs - across on-prem, multi-cloud (AWS/Azure), containers, and SaaS integrations.
Rationalize and consolidate the current fragmented tooling landscape into a cohesive, cost-optimized platform. Eliminate redundant tools, reduce alert noise by 80%+, and establish a single pane of glass for system health.
Drive adoption of OpenTelemetry as the standard instrumentation framework, ensuring vendor-agnostic telemetry collection and future portability.
PREDICTIVE & PROACTIVE RELIABILITY
Build and operationalize AIOps and ML-driven capabilities to detect anomalies, predict failures, and surface emerging risks before they impact customers. Move beyond threshold-based alerting to intelligent, context-aware detection.
Establish automated correlation engines that link infrastructure signals, application traces, deployment events, and change records to dramatically reduce diagnostic time and identify root cause automatically.
Design and implement self-healing automation that detects, diagnoses, and remediates common failure patterns without human intervention – targeting 40%+ of recurring incidents for autonomous resolution.
Introduce chaos engineering and reliability testing programs (GameDays, fault injection, load testing) to proactively discover weaknesses before production incidents reveal them.
SITE RELIABILITY ENGINEERING CULTURE
Transform the existing operations-centric team into a modern SRE organization with embedded reliability engineers across product and platform squads, operating under a "you build it, you own it" model.
Define and implement SLO/SLI/Error Budget frameworks across critical services, creating a shared language between engineering, product, and business stakeholders for reliability decisions.
Drive the adoption of DevOps practices, CI/CD pipelines, and infrastructure as code using tools like Terraform or CloudFormation to manage infrastructure.
Champion reliability-first design principles - ensuring observability, graceful degradation, circuit breaking, and failure isolation are architected into every system from day one, not bolted on after launch.
INCIDENT PREVENTION & RAPID RECOVERY
Partner with Major Incident Management and Problem Management to build closed-loop feedback systems - every incident produces a reliability improvement, not just a postmortem document.
Drive MTTR toward minutes (not hours) through automated diagnostics, pre-built remediation playbooks, and intelligent correlation that tells responders what is wrong, not just that something is wrong.
Establish "Incidents Prevented" as a primary success metric alongside traditional MTTR/MTTD measures.
BUSINESS-ALIGNED OBSERVABILITY
Elevate observability from infrastructure metrics to business outcomes. Build real-time dashboards that connect system health to revenue impact, customer experience scores, and SLA compliance.
Integrate observability insights into ITSM (ServiceNow), data platforms, and executive reporting - making reliability data a first-class input to business and technology decision-making.
ENGINEERING & OPERATIONAL EXCELLENCE
Own the total cost of ownership of the observability platform. Optimize spend through data tiering, intelligent sampling, retention policies, and vendor negotiations. Deliver more insight per dollar.
Manage strategic vendor relationships (Datadog, Splunk, Logic Monitor, cloud-native tooling) with a focus on maximizing value extraction, not just license management.
Build a platform engineering mindset: observability capabilities are delivered as self-service products to engineering teams – instrumentation libraries, dashboard templates, alerting-as-code, SLO toolkits.
TEAM BUILDING & LEADERSHIP
Recruit, develop, and retain a world-class team of SRE engineers, observability platform engineers, data and performance engineers, and reliability analysts.
Establish an Observability & SRE Centre of Excellence that drives standards, best practices, and enablement across the global enterprise.
Foster a learning culture through internal tech talks, blameless postmortems, chaos engineering programs, and industry engagement.
REQUIRED EXPERIENCE & EXPERTISE
15+ years in technology with 8+ years in progressively senior observability, SRE, or platform reliability leadership roles.
Demonstrated track record of transforming reactive monitoring organizations into proactive, engineering-driven SRE functions at enterprise scale (10,000+ employees, 1,000+ applications).
Deep expertise across the full observability stack: metrics (Prometheus, Datadog, CloudWatch), distributed tracing (Jaeger, OpenTelemetry, Datadog APM), log aggregation (Splunk, ELK, Datadog Logs), synthetic monitoring, and RUM.
Hands-on experience defining and operationalizing SLO/SLI/Error Budget frameworks that drive engineering prioritization and business alignment.
Proven experience building AIOps / ML-driven anomaly detection and automated remediation capabilities - not just evaluating vendor demos, but delivering production systems that prevent real incidents.
Strong background in chaos engineering, resilience testing, and reliability-by-design practices (circuit breakers, bulkheads, graceful degradation, retry/backoff patterns).
Experience operating across hybrid infrastructure: on-premises data centers, AWS, Azure, containerized workloads (Kubernetes), and SaaS platforms.
Demonstrated ability to drive cultural and organizational transformation across large, complex enterprises with multiple business units and hundreds of engineering squads.
Experience managing $5M+ observability platform budgets and optimizing total cost of ownership while expanding coverage and capability.
Executive communication skills - ability to present reliability strategy, risk posture, and investment cases to C-suite and board-level audiences.
Visionary thinker who can articulate a compelling future state and build the roadmap to get there - then execute relentlessly.