Top-Level Architecture & Technology Selection: Gain deep insights into customer needs and collaborate with stakeholders to design holistic GPU infrastructure platform architectures, covering IDC infrastructure planning, hardware topology, high-performance networking, and technology stack selection.
Technical Leadership & Governance: Provide comprehensive technical leadership for platform design and delivery. Maintain efficient communication with customer CTOs and core technical teams to make strategic decisions on holistic solutions and key technical challenges.
Complex Technical Problem-Solving: Lead the resolution of high-difficulty technical issues throughout the project lifecycle. Coordinate with customers, NVIDIA, and third-party partners to precisely identify and overcome technical bottlenecks, ensuring system stability.
Project Execution & Technical Control: Oversee the technical aspects of project implementation, including site surveys, hardware rack-and-stack, cabling, software/hardware testing, stress testing, and performance tuning, ensuring delivery quality meets top-tier industry standards.
Knowledge Management & Enablement: Drive project retrospectives and technical knowledge accumulation. Establish a robust knowledge base and conduct internal technical training and knowledge transfer.
Qualifications:
Education: Bachelor’s degree or higher in Computer Science, Electronic Engineering, Automation, or related fields.
Experience: 8+ years of experience in cloud platforms, AI infrastructure, or large-scale data center construction, with proven track records in building enterprise-grade Cloud & AI platforms from scratch.
Technical Expertise:
- GPU Architecture: Deep understanding of NVIDIA GPU architectures, especially Blackwell series and NVLink interconnect technologies.
- HPC Networking: Proficiency in High-Performance Computing networks, particularly InfiniBand.
- Distributed Systems: Solid theoretical foundation and practical knowledge of distributed systems.
Infrastructure Knowledge: Familiarity with mainstream server/network equipment, Linux OS, virtualization, containerization, and data center infrastructure (power, cooling, layout).
Soft Skills: Strong resilience and technical leadership skills, capable of leading teams through end-to-end platform delivery.