Senior Observability & SRE Leader

Marsh -
Toronto, ON

Postuler dès maintenant

Détails du poste

Temps plein

Profil recherché

Azure
Kubernetes
Développement à pile complète
DevOps
AWS
Terraform
Splunk
Service Now
Gestion de budget
Compétences en communication

Description complète du poste

Company:

Marsh Corporate

Description:

Senior Observability & SRE Leader

Marsh is seeking a visionary, transformational leader to reimagine and rebuild our Observability and Site Reliability Engineering function from the ground up. This is not a role for someone who wants to maintain the status quo. We need a leader who will fundamentally shift this function to a predictive, data-driven engineering discipline that prevents outages before they happen, embeds reliability into every system from design through production, and treats observability data as a strategic asset - not just an operational tool.

This is a career-defining opportunity to build a world-class observability and SRE organization at Fortune 500 scale.

Job Responsibilities

STRATEGIC VISION & PLATFORM TRANSFORMATION

Define and execute an observability and SRE strategy that shifts the organization from reactive operations to predictive reliability engineering.
Architect and deliver a unified, full-stack observability platform covering metrics, traces, logs, real-user monitoring (RUM), synthetic monitoring, and business-level KPIs - across on-prem, multi-cloud (AWS/Azure), containers, and SaaS integrations.
Rationalize and consolidate the current fragmented tooling landscape into a cohesive, cost-optimized platform. Eliminate redundant tools, reduce alert noise by 80%+, and establish a single pane of glass for system health.
Drive adoption of OpenTelemetry as the standard instrumentation framework, ensuring vendor-agnostic telemetry collection and future portability.

PREDICTIVE & PROACTIVE RELIABILITY

Build and operationalize AIOps and ML-driven capabilities to detect anomalies, predict failures, and surface emerging risks before they impact customers. Move beyond threshold-based alerting to intelligent, context-aware detection.
Establish automated correlation engines that link infrastructure signals, application traces, deployment events, and change records to dramatically reduce diagnostic time and identify root cause automatically.
Design and implement self-healing automation that detects, diagnoses, and remediates common failure patterns without human intervention – targeting 40%+ of recurring incidents for autonomous resolution.
Introduce chaos engineering and reliability testing programs (GameDays, fault injection, load testing) to proactively discover weaknesses before production incidents reveal them.

SITE RELIABILITY ENGINEERING CULTURE

Transform the existing operations-centric team into a modern SRE organization with embedded reliability engineers across product and platform squads, operating under a "you build it, you own it" model.
Define and implement SLO/SLI/Error Budget frameworks across critical services, creating a shared language between engineering, product, and business stakeholders for reliability decisions.
Drive the adoption of DevOps practices, CI/CD pipelines, and infrastructure as code using tools like Terraform or CloudFormation to manage infrastructure.
Champion reliability-first design principles - ensuring observability, graceful degradation, circuit breaking, and failure isolation are architected into every system from day one, not bolted on after launch.

INCIDENT PREVENTION & RAPID RECOVERY

Partner with Major Incident Management and Problem Management to build closed-loop feedback systems - every incident produces a reliability improvement, not just a postmortem document.
Drive MTTR toward minutes (not hours) through automated diagnostics, pre-built remediation playbooks, and intelligent correlation that tells responders what is wrong, not just that something is wrong.
Establish "Incidents Prevented" as a primary success metric alongside traditional MTTR/MTTD measures.

BUSINESS-ALIGNED OBSERVABILITY

Elevate observability from infrastructure metrics to business outcomes. Build real-time dashboards that connect system health to revenue impact, customer experience scores, and SLA compliance.
Integrate observability insights into ITSM (ServiceNow), data platforms, and executive reporting - making reliability data a first-class input to business and technology decision-making.

ENGINEERING & OPERATIONAL EXCELLENCE

Own the total cost of ownership of the observability platform. Optimize spend through data tiering, intelligent sampling, retention policies, and vendor negotiations. Deliver more insight per dollar.
Manage strategic vendor relationships (Datadog, Splunk, Logic Monitor, cloud-native tooling) with a focus on maximizing value extraction, not just license management.
Build a platform engineering mindset: observability capabilities are delivered as self-service products to engineering teams – instrumentation libraries, dashboard templates, alerting-as-code, SLO toolkits.

TEAM BUILDING & LEADERSHIP

Recruit, develop, and retain a world-class team of SRE engineers, observability platform engineers, data and performance engineers, and reliability analysts.
Establish an Observability & SRE Centre of Excellence that drives standards, best practices, and enablement across the global enterprise.
Foster a learning culture through internal tech talks, blameless postmortems, chaos engineering programs, and industry engagement.

REQUIRED EXPERIENCE & EXPERTISE

15+ years in technology with 8+ years in progressively senior observability, SRE, or platform reliability leadership roles.
Demonstrated track record of transforming reactive monitoring organizations into proactive, engineering-driven SRE functions at enterprise scale (10,000+ employees, 1,000+ applications).
Deep expertise across the full observability stack: metrics (Prometheus, Datadog, CloudWatch), distributed tracing (Jaeger, OpenTelemetry, Datadog APM), log aggregation (Splunk, ELK, Datadog Logs), synthetic monitoring, and RUM.
Hands-on experience defining and operationalizing SLO/SLI/Error Budget frameworks that drive engineering prioritization and business alignment.
Proven experience building AIOps / ML-driven anomaly detection and automated remediation capabilities - not just evaluating vendor demos, but delivering production systems that prevent real incidents.
Strong background in chaos engineering, resilience testing, and reliability-by-design practices (circuit breakers, bulkheads, graceful degradation, retry/backoff patterns).
Experience operating across hybrid infrastructure: on-premises data centers, AWS, Azure, containerized workloads (Kubernetes), and SaaS platforms.
Demonstrated ability to drive cultural and organizational transformation across large, complex enterprises with multiple business units and hundreds of engineering squads.
Experience managing $5M+ observability platform budgets and optimizing total cost of ownership while expanding coverage and capability.
Executive communication skills - ability to present reliability strategy, risk posture, and investment cases to C-suite and board-level audiences.
Visionary thinker who can articulate a compelling future state and build the roadmap to get there - then execute relentlessly.

Marsh (NYSE: MRSH) is a global leader in risk, reinsurance and capital, people and investments, and management consulting, advising clients in 130 countries. With annual revenue of over $27 billion and more than 95,000 colleagues, Marsh helps build the confidence to thrive through the power of perspective. For more information, visit corporate.marsh.com, or follow us on LinkedIn and X.

Marsh is committed to embracing a diverse, inclusive and flexible work environment. We aim to attract and retain the best people and embrace diversity of age background, disability, ethnic origin, family duties, gender orientation or expression, marital status, nationality, parental status, personal or social status, political affiliation, race, religion and beliefs, sex/gender, sexual orientation or expression, skin color, or any other characteristic protected by applicable law. In accordance with the Accessibility for Ontarians with Disabilities Act, 2005, Marsh will provide a reasonable accommodation to employees and prospective employees to the point of undue hardship upon request and as required in respect of the individual’s particular restrictions and limitations. If you require a specific accommodation because of a disability or medical need, please contact [email protected].

Marsh is committed to hybrid work, which includes the flexibility of working remotely and the collaboration, connections and professional development benefits of working together in the office. All Marsh colleagues are expected to be in their local office or working onsite with clients at least three days per week. Office-based teams will identify at least one “anchor day” per week on which their full team will be together in person.

This is a New position.

Postuler dès maintenant

Company:

Description:

This is a New position.

Outils pour les chercheurs d'emploi

Outils Employeurs

Parcourir

Garder le contact