Hardware Integration & Standardized Delivery: Responsible for the rack installation and physical connection of GPU servers, liquid-cooled cabinets, and network equipment. Strictly adhere to hardware installation standards to ensure operational safety and compliance.
Hardware Diagnostics & Maintenance: Troubleshoot and maintain core server components (motherboard, CPU, GPU, memory, HDD, PSU). Monitor hardware health status to promptly identify and mitigate potential risks.
Liquid Cooling System Support: Collaborate with Facility Management engineers for liquid cooling system inspections and fault handling. Possess rapid response capabilities for emergencies such as leaks or blockages.
System-Level Troubleshooting: Conduct in-depth fault diagnosis based on Linux/Unix systems, produce professional hardware failure analysis reports, and provide improvement recommendations.
Qualifications
:Skills: Proficient in Linux/UNIX systems and Shell/Python scripting, with the ability to independently troubleshoot system-level issues
.Experience: 3+ years of experience in data center hardware installation, testing, and maintenance. Experience with high-performance GPU servers (e.g., NVIDIA A100/H100/B300) or HPC cluster maintenance is preferred
.Certifications: NVIDIA Certified Engineer (NCE) hardware certification is preferred
.Liquid Cooling: Familiarity with liquid cooling principles; hands-on experience in disassembly, leak detection, and fluid replenishment for liquid-cooled servers is a plus
.Project Background: Prior participation in the construction and delivery of large-scale AI Computing Centers (AICC) or Supercomputing Centers is preferred