Location: Mackenzie, BC or Prince George, BC, Canada
Work Model: Fully onsite – 5 days/week (Candidates must be willing to work onsite in either Prince George or Mackenzie, BC)
Required Skills
Hands-on experience with Dell PowerEdge Rack/Tower servers
NVIDIA certifications
Preferred Skills
Experience with Dell PowerEdge XE servers
Experience with NVIDIA Quantum (QR) switches
Key Technical Requirements
Strong hands-on experience with GPU deployment, configuration, and multi-node testing using NVIDIA Base Command Manager
Expertise with benchmarking and performance tools such as HPL, STREAM, NCCL, RCCL, MxP, and OSU Microbenchmarks
Red Hat certification (RHCSA/RHCE) or 7+ years of experience with Red Hat-based Linux distributions
Experience with GenAI/HPC networking technologies including InfiniBand and/or RoCE
Experience working in large-scale Linux-based parallel computing environments
Strong customer interaction and communication skills
Qualifications
Bachelor’s degree
NVIDIA certifications such as NCA, NCE, or DGX
Experience with NVIDIA UFM, InfiniBand, and Spectrum-X fabrics
Exposure to hybrid cloud or GPU cloud environments
Experience with GPU observability and performance profiling tools
Responsibilities
Code Upgrades
Perform cluster-level code upgrades in alignment with approved versions and compatibility standards
iDRAC Management
Configure, validate, and monitor iDRAC access and system health
Provide troubleshooting and lifecycle management support
Firmware Management
Update BIOS, NIC, storage, server, and related firmware components
Validate firmware compatibility and post-upgrade system health
Redfish Automation
Utilize Redfish APIs for system management and monitoring
Develop automation and customization using Redfish APIs
BlueField Administration
Configure and manage NVIDIA BlueField DPUs