Deltek is looking for a Team Lead, Senior Software Engineer to join our Site Reliability Engineering team. In this role, you will be responsible for the reliability, scalability, and performance of our globally-used SaaS platforms. You will bridge the gap between software engineering and infrastructure operations, building the tools, automation, and systems that keep our products running for thousands of customers and millions of users.
This is a high-ownership role in a "never-stop-learning" environment. You will work closely with development teams to embed reliability practices early in the software lifecycle, respond to production incidents, and drive continuous improvements to our observability and operational posture.
Key Responsibilities:
Site Reliability & Platform Engineering:
-
Design, build, and maintain the infrastructure and tooling that underpins Deltek's SaaS platforms at scale.
-
Drive reliability improvements across the full stack, spanning application-level resilience patterns through to infrastructure-level fault tolerance.
-
Uphold and extend our IaC-first engineering culture, where all infrastructure changes are made through code and shipped to production via fully automated CI/CD pipelines.
-
Build and improve CI/CD pipelines to support safe, frequent deployments with automated rollback capabilities.
-
Develop internal tooling and automation to reduce toil and increase engineering self-service.
Observability & Performance:
-
Design and maintain comprehensive observability solutions including logging, metrics, tracing, and alerting across our AWS-based infrastructure.
-
Proactively identify performance bottlenecks and reliability risks before they impact customers.
-
Conduct capacity planning and load testing to ensure systems can scale to meet demand.
Incident Management & On-Call Support:
-
Participate in and own the on-call rotation, ensuring fair distribution and adequate coverage across the team, and acting as a first responder for production incidents affecting our SaaS platforms.
-
Lead incident response: triage, coordinate cross-team resolution, communicate clearly with stakeholders, and drive issues to resolution with a sense of urgency.
-
Own post-incident reviews, facilitate blameless post-mortems, identify root causes, and ensure action items are tracked and completed.
-
Take pride in leaving systems better than you found them, consistently reducing the frequency and impact of incidents over time.
Team Leadership:
-
Act as the technical lead for the SRE team, setting direction, priorities, and standards for how the team operates.
-
Lead and facilitate team ceremonies including standups, retrospectives, and planning sessions.
-
Serve as an escalation point for complex or high-severity incidents, providing guidance and support to engineers during critical moments.
-
Collaborate with engineering managers and stakeholders to align SRE priorities with broader product and platform goals.
Collaboration & Engineering Culture:
-
Partner with software engineering teams to review system designs and architectures with a reliability lens.
-
Mentor and provide technical guidance to junior engineers on SRE practices, tooling, and operational excellence.
-
Contribute to a strong team culture, supportive, curious, and focused on doing great work while having fun.