Senior Enterprise Platform Reliability Engineer

Essex Weld Solutions
Essex, ON

Postuler rapidement

Détails du poste

Permanent | Temps plein
De 80 000 $ à 90 000 $ par an
Il y a 6 jours

Profil recherché

Optimisation des performances
DevOps
Systèmes PGI
Reprise après sinistre
SQL
Pare-feu
Réponse aux incidents
PostgreSQL
IDS
Virtualisation
ISO 27001
Linux
Disponibilité élevée
VPN

Description complète du poste

Senior Enterprise Platform Reliability Engineer

Infrastructure, ERP, Database & Security Operations

Position Summary

We are seeking an experienced Senior Enterprise Platform Reliability Engineer to lead the reliability, scalability, security, and operational continuity of enterprise infrastructure and ERP environments.

This position combines responsibilities across:

Enterprise Infrastructure Architecture

Site Reliability Engineering (SRE)

Database Reliability Engineering (DBRE)

ERP Platform Engineering

Security Engineering & Compliance

Operational Leadership

You will be responsible for maintaining and evolving mission-critical production systems that support multi-region ERP operations, PostgreSQL database environments, Linux infrastructure, virtualization platforms, security monitoring, and enterprise observability.

The ideal candidate is capable of operating independently within complex production environments, troubleshooting high-impact incidents, improving operational maturity, and driving long-term infrastructure reliability.

This is not a traditional DevOps role.

Key Responsibilities

Enterprise Infrastructure Architecture

Design, maintain, and improve enterprise-grade infrastructure environments, including:

Multi-region Linux infrastructure environments
High-availability PostgreSQL clusters using Patroni, etcd, and Keepalived
Inter-datacenter networking and secure VPN architecture using WireGuard
Proxmox virtualization infrastructure and workload management
OPNsense firewall, routing, reverse proxy, and edge security architecture
Enterprise storage, backup, and disaster recovery systems following 3-2-1 backup strategies
Infrastructure redundancy and failover planning
Production workload scaling and operational continuity planning

ERP Platform Engineering

Manage and support enterprise Odoo ERP environments, including:

Odoo v9, v16, and v18 production environments
Multi-region ERP deployments and infrastructure coordination
Custom module integration support
Worker tuning, memory analysis, and platform scaling
High-availability ERP failover environments
Production recovery and restoration workflows
Neutralized production restores for development environments
Release troubleshooting and production issue resolution
ERP operational performance optimization

Database Reliability Engineering (DBRE)

Responsible for maintaining the stability, performance, recoverability, and availability of PostgreSQL environments supporting mission-critical business systems.

Responsibilities include:

PostgreSQL performance tuning and workload optimization
SQL execution plan analysis and query troubleshooting
Locking, contention, and replication analysis
Autovacuum and database maintenance strategy management
Cache-hit ratio and buffer performance analysis
High-availability PostgreSQL architecture and failover management
Backup validation and recovery testing
Disaster recovery validation and restoration procedures
Monitoring long-running queries and operational bottlenecks
Database observability using Prometheus, Grafana, exporters, and log aggregation systems
Database operational support during releases, migrations, and upgrades

Site Reliability Engineering (SRE)

Ensure the reliability, availability, stability, and operational continuity of enterprise production systems.

Responsibilities include:

Maintaining high system uptime across multi-region environments
Designing and managing enterprise observability platforms
Centralized monitoring, metrics collection, logging, and alerting
Proactive alerting strategy development
Production incident troubleshooting and operational response
Root-cause analysis and operational recovery coordination
High-availability infrastructure design and failover validation
Backup validation and disaster recovery readiness testing
Infrastructure and platform health monitoring
Operational documentation and reliability process development
Cross-functional collaboration between infrastructure, development, database, and operational teams

Security Engineering & Compliance

Design and maintain enterprise security architecture and operational controls across infrastructure, databases, networking, and ERP systems.

Responsibilities include:

SIEM architecture and centralized security monitoring
Wazuh and Security Onion deployment and management
IDS/IPS implementation and network security monitoring
Firewall segmentation and network security policy design
Secure VPN architecture and encrypted inter-site connectivity
Security observability and enterprise logging
Vulnerability identification and operational risk analysis
Security incident investigation and forensic support
Business Continuity Planning (BCP) and Disaster Recovery (DR) strategy development
ISO 27001-aligned operational security practices
PIPEDA-aware operational controls and data protection processes
Security documentation, audit readiness, and compliance support

Operational Leadership

Provide operational leadership and technical governance across infrastructure and production operations.

Responsibilities include:

Developing and maintaining Standard Operating Procedures (SOPs)
Defining operational standards and governance processes
Coordinating production incident escalation and response
Supporting deployment governance and change management practices
Evaluating operational risks associated with infrastructure and software changes
Supporting development teams with deployment and infrastructure troubleshooting
Creating operational workflows and recovery procedures
Improving infrastructure maturity, standardization, and reliability practices
Supporting management with infrastructure planning and operational readiness initiatives
Driving long-term infrastructure sustainability and operational resilience

Required Qualifications

7+ years of experience managing enterprise Linux infrastructure
Advanced PostgreSQL administration and performance tuning experience
Strong understanding of high-availability architecture and failover systems
Experience managing enterprise virtualization platforms such as Proxmox
Experience with observability platforms including Grafana, Prometheus, Loki, and exporters
Strong networking knowledge including VPNs, routing, firewalls, and reverse proxies
Experience supporting production ERP environments
Strong incident response and troubleshooting abilities
Experience designing backup and disaster recovery strategies
Strong understanding of operational security and infrastructure hardening
Ability to independently manage production-critical systems
Strong documentation and operational process development skills

Preferred Qualifications

Experience supporting Odoo ERP environments
Experience with Patroni, etcd, and PostgreSQL HA clustering
Experience with Wazuh, Security Onion, or SIEM platforms
Familiarity with ISO 27001 operational practices
Experience managing multi-region infrastructure deployments
Experience working within hybrid cloud and on-premise environments
Experience leading operational improvement initiatives

What Success Looks Like

Successful candidates will:

Operate comfortably within complex enterprise production environments
Take ownership of infrastructure reliability and operational continuity
Improve platform stability and observability over time
Reduce operational risk through automation, documentation, and standardization
Troubleshoot high-impact production issues efficiently and methodically
Balance performance, reliability, scalability, and security considerations
Communicate clearly with technical and non-technical stakeholders

Environment & Technology Stack

Infrastructure

Ubuntu Linux
Proxmox
OPNsense
WireGuard
Enterprise storage and backup systems

Databases

PostgreSQL
Patroni
etcd
HA clustering and replication environments

ERP / Application Platforms

Odoo ERP
Custom module environments
Multi-region application deployments

Monitoring & Observability

Grafana
Prometheus
Loki
Exporters
SIEM platforms

Security

Wazuh
Security Onion
IDS/IPS systems
Reverse proxy architecture

Compensation & Benefits

Compensation will be competitive and aligned with experience, technical capability, and operational leadership ability.

Additional benefits may include:

Extended health and dental coverage
Paid vacation
Professional development support
Flexible work arrangements
Access to enterprise-grade infrastructure environments
Long-term career growth opportunities

Pay: $80,000.00-$90,000.00 per year

Benefits:

Company events
Dental care
Employee assistance program
Extended health care
Flexible schedule
Life insurance
On-site parking

Work Location: In person

Postuler rapidement

Outils pour les chercheurs d'emploi

Outils Employeurs

Parcourir

Garder le contact