SLURM HPC Architect / Administrator
Location: Remote (Canada, U.S., or Europe Preferred)
Company: Cylix Applied Intelligence
Employment Type: Full-Time or Contract
About the Role
Cylix Applied Intelligence is seeking an experienced SLURM HPC Architect / Administrator to design, deploy, and operate high-performance computing (HPC) clusters supporting AI training, large-scale inference, scientific computing, and enterprise workloads.
This role will focus on building and managing enterprise-grade HPC environments powered by GPU and CPU compute clusters, leveraging SLURM as the core workload orchestration and resource scheduling platform.
You will work closely with AI engineers, infrastructure teams, and enterprise clients to deliver scalable, reliable, and high-performance compute environments across on-premise, hybrid, and cloud platforms.
Key Responsibilities
HPC Cluster Architecture and Design
- Design and implement SLURM-based HPC cluster architectures
- Architect scalable CPU and GPU compute environments
- Define cluster topology including compute, storage, login, and management nodes
- Design high-availability SLURM controller configurations
- Implement cluster segmentation, partitioning, and resource allocation strategies
SLURM Deployment and Administration
- Install, configure, and manage SLURM workload manager environments
- Configure SLURM partitions, queues, QoS policies, and scheduling policies
- Manage job scheduling optimization and fair-share policies
- Implement accounting, usage tracking, and reporting systems
- Maintain SLURM cluster health, stability, and performance
GPU Cluster and AI Infrastructure Management
- Configure GPU scheduling and allocation policies
- Support GPU resource management including:
- NVIDIA A100, H100, L40, and similar accelerator platforms
- MIG partitioning and GPU isolation
- Multi-tenant GPU resource allocation
- Optimize cluster performance for AI training and inference workloads
Infrastructure Automation and Operations
- Automate cluster deployment and configuration using:
- Ansible, Terraform, or similar tools
- Shell scripting and Python
- Implement monitoring, alerting, and performance tracking systems
- Support cluster lifecycle management, upgrades, and expansion
Storage and Filesystem Integration
- Integrate HPC clusters with high-performance storage systems including:
- NFS
- Lustre
- BeeGFS
- GPFS / Spectrum Scale
- Optimize I/O performance and storage architecture
User and Workload Support
- Support enterprise and research users with job scheduling and optimization
- Troubleshoot job failures and performance issues
- Assist engineering teams in optimizing workloads for HPC environments
Required Qualifications
- 3+ years experience administering HPC clusters
- Strong experience with SLURM workload manager
- Strong Linux system administration experience (Ubuntu, Rocky Linux, RHEL, or similar)
- Experience with HPC cluster architecture and deployment
- Experience with shell scripting and automation
- Experience with:
- Cluster resource management
- Multi-node distributed computing environments
- SSH, networking, and Linux system internals
Preferred Qualifications
- Experience managing GPU-based HPC clusters
- Experience supporting AI / ML workloads
- Experience with NVIDIA GPU platforms and drivers
- Experience with:
- CUDA environments
- NVIDIA MIG configuration
- GPU scheduling optimization
- Experience with configuration management tools:
- Ansible
- Terraform
- Puppet or Chef
- Experience with monitoring tools such as:
- Prometheus
- Grafana
- Node exporter
- SLURM accounting tools
Nice to Have
- Experience with large-scale enterprise or cloud HPC environments
- Experience deploying HPC environments in cloud platforms such as:
- AWS
- Azure
- Private cloud environments
- Experience with containerized HPC workloads:
- Docker
- Singularity / Apptainer
- Experience integrating SLURM with Kubernetes or hybrid orchestration systems
Example Projects You Will Work On
- Deployment of enterprise AI GPU clusters
- Multi-tenant SLURM cluster architecture design
- GPU scheduling optimization for AI workloads
- HPC infrastructure for large-scale inference and model training
- Hybrid HPC environments spanning data center and cloud
- HPC cluster performance optimization and scaling
Technology Environment
You will work with:
- SLURM Workload Manager
- Linux (Ubuntu, Rocky Linux, RHEL)
- NVIDIA GPU platforms (A100, H100, L40)
- High-performance storage systems
- HPC networking (InfiniBand, high-speed Ethernet)
- Automation tools (Ansible, Terraform)
- Monitoring tools (Prometheus, Grafana)
- Container environments (Docker, Apptainer)
What We Offer
- Competitive compensation
- Remote-first environment
- Opportunity to work with cutting-edge HPC and AI infrastructure
- Exposure to enterprise-scale AI and compute environments
- Flexible employment structure (Full-Time or Contract)
- Opportunity to architect next-generation HPC environments
About Cylix Applied Intelligence
Cylix Applied Intelligence builds enterprise AI infrastructure and high-performance computing environments supporting advanced AI workloads, intelligent automation, and enterprise-scale compute platforms.