SLURM HPC Architect / Administrator

Arcadion -
Ottawa, ON

Postuler rapidement

Détails du poste

Temps plein

Profil recherché

QoS
Administration de système
Azure
Node.js
NFS
Kubernetes
Ansible
AWS
Docker
Systèmes distribués
Terraform
Comptabilité
Ubuntu
Puppet
SSH
Linux
Chef
Intelligence artificielle
Python
Shell scripting

Description complète du poste

SLURM HPC Architect / Administrator

Location: Remote (Canada, U.S., or Europe Preferred)
Company: Cylix Applied Intelligence
Employment Type: Full-Time or Contract

About the Role

Cylix Applied Intelligence is seeking an experienced SLURM HPC Architect / Administrator to design, deploy, and operate high-performance computing (HPC) clusters supporting AI training, large-scale inference, scientific computing, and enterprise workloads.

This role will focus on building and managing enterprise-grade HPC environments powered by GPU and CPU compute clusters, leveraging SLURM as the core workload orchestration and resource scheduling platform.

You will work closely with AI engineers, infrastructure teams, and enterprise clients to deliver scalable, reliable, and high-performance compute environments across on-premise, hybrid, and cloud platforms.

Key Responsibilities

HPC Cluster Architecture and Design

Design and implement SLURM-based HPC cluster architectures
Architect scalable CPU and GPU compute environments
Define cluster topology including compute, storage, login, and management nodes
Design high-availability SLURM controller configurations
Implement cluster segmentation, partitioning, and resource allocation strategies

SLURM Deployment and Administration

Install, configure, and manage SLURM workload manager environments
Configure SLURM partitions, queues, QoS policies, and scheduling policies
Manage job scheduling optimization and fair-share policies
Implement accounting, usage tracking, and reporting systems
Maintain SLURM cluster health, stability, and performance

GPU Cluster and AI Infrastructure Management

Configure GPU scheduling and allocation policies
Support GPU resource management including:
- NVIDIA A100, H100, L40, and similar accelerator platforms
- MIG partitioning and GPU isolation
- Multi-tenant GPU resource allocation
Optimize cluster performance for AI training and inference workloads

Infrastructure Automation and Operations

Automate cluster deployment and configuration using:
- Ansible, Terraform, or similar tools
- Shell scripting and Python
Implement monitoring, alerting, and performance tracking systems
Support cluster lifecycle management, upgrades, and expansion

Storage and Filesystem Integration

Integrate HPC clusters with high-performance storage systems including:
- NFS
- Lustre
- BeeGFS
- GPFS / Spectrum Scale
Optimize I/O performance and storage architecture

User and Workload Support

Support enterprise and research users with job scheduling and optimization
Troubleshoot job failures and performance issues
Assist engineering teams in optimizing workloads for HPC environments

Required Qualifications

3+ years experience administering HPC clusters
Strong experience with SLURM workload manager
Strong Linux system administration experience (Ubuntu, Rocky Linux, RHEL, or similar)
Experience with HPC cluster architecture and deployment
Experience with shell scripting and automation
Experience with:
- Cluster resource management
- Multi-node distributed computing environments
- SSH, networking, and Linux system internals

Preferred Qualifications

Experience managing GPU-based HPC clusters
Experience supporting AI / ML workloads
Experience with NVIDIA GPU platforms and drivers
Experience with:
- CUDA environments
- NVIDIA MIG configuration
- GPU scheduling optimization
Experience with configuration management tools:
- Ansible
- Terraform
- Puppet or Chef
Experience with monitoring tools such as:
- Prometheus
- Grafana
- Node exporter
- SLURM accounting tools

Nice to Have

Experience with large-scale enterprise or cloud HPC environments
Experience deploying HPC environments in cloud platforms such as:
- AWS
- Azure
- Private cloud environments
Experience with containerized HPC workloads:
- Docker
- Singularity / Apptainer
Experience integrating SLURM with Kubernetes or hybrid orchestration systems

Example Projects You Will Work On

Deployment of enterprise AI GPU clusters
Multi-tenant SLURM cluster architecture design
GPU scheduling optimization for AI workloads
HPC infrastructure for large-scale inference and model training
Hybrid HPC environments spanning data center and cloud
HPC cluster performance optimization and scaling

Technology Environment

You will work with:

SLURM Workload Manager
Linux (Ubuntu, Rocky Linux, RHEL)
NVIDIA GPU platforms (A100, H100, L40)
High-performance storage systems
HPC networking (InfiniBand, high-speed Ethernet)
Automation tools (Ansible, Terraform)
Monitoring tools (Prometheus, Grafana)
Container environments (Docker, Apptainer)

What We Offer

Competitive compensation
Remote-first environment
Opportunity to work with cutting-edge HPC and AI infrastructure
Exposure to enterprise-scale AI and compute environments
Flexible employment structure (Full-Time or Contract)
Opportunity to architect next-generation HPC environments

About Cylix Applied Intelligence

Cylix Applied Intelligence builds enterprise AI infrastructure and high-performance computing environments supporting advanced AI workloads, intelligent automation, and enterprise-scale compute platforms.

Postuler rapidement

Outils pour les chercheurs d'emploi

Outils Employeurs

Parcourir

Garder le contact