jobs in Kapaciti.AI

Kapaciti.AI Hiring! Full Time Data Center - AI Infrastructure Engineer in - Ricebowl

Data Center - AI Infrastructure Engineer

Kapaciti.AI

Undisclosed

Singapore

Share
Save

Working Location

  • Singapore

Job Description

Responsibilities

Job Description: Data Center & AI Infrastructure Engineer


Role Overview

The Data Center & AI Infrastructure Engineer is responsible for designing, operating, maintaining, and optimizing the critical physical and accelerated computing infrastructure of our high-availability data center facilities. This role bridges the gap between traditional critical facilities engineering (power, cooling, safety, cabling) and modern AI infrastructure engineering. You will ensure the continuous availability, safety, efficiency, and scalability of both standard facilities and enterprise AI Factories featuring high-density GPU clusters and rack-scale architectures.

As a mid-to-senior technical resource, you will act as a key subject matter expert and "sparring partner" with vendors, systems integrators, and contractors. Your role will involve verifying equipment specifications, conducting complex mechanical/electrical calculations, and validating advanced design plans (such as Reference Configurations) to minimize downtime, optimize total cost of ownership (TCO), and prevent costly operational failures.


Position Details

  • Position Type: Full-Time
  • Work Location: On-site / Regional AI Data Center Facility
  • Department: Critical Facilities & AI Infrastructure Operations
  • Target Compensation: $110,000 – $160,000 Base Salary (Subject to experience and certification levels; senior solutions architects and infrastructure professionals with specialized NVIDIA credentials command up to $140,000 – $190,000 in major hubs like Northern Virginia, Phoenix, and Dallas-Fort Worth).


Core Responsibilities

1. High-Density Power & Electrical Systems Management

  • AI Power Distribution: Manage and monitor the electrical distribution path from the utility substation and generator level down to high-density racks. This includes handling extreme load requirements common to AI training clusters, which range from 40 kW to over 100+ kW per rack.
  • Redundancy & Failover: Configure and manage redundant power topologies (such as concurrent maintainability and catcher systems) to maximize utilization and avoid stranded capacity. Operate automatic and static transfer switches (ATS/STS), switchgear, and UPS systems (both static and dynamic).
  • Electrical Sizing & Verification: Read, analyze, and review electrical Single Line Diagrams (SLD) to identify design issues, verify breaker/protection sizing, and prevent electrical distribution failures. Sizing calculation models must cover parallel UPS setups and battery banks (including Battery Energy Storage Systems (BESS)).
  • Electrical Safety & Protection: Oversee electrical bonding and clean grounding compliance (aligned with TIA-607 standards) to safeguard sensitive IT components. Manage safe proximity boundaries and shielding materials to mitigate Electromagnetic Field (EMF) interference.

2. Advanced Thermal Management & Liquid Cooling

  • Cooling Operations: Manage advanced precision cooling infrastructure, including chilled water loops, dry coolers, DX systems, and water- and air-side economizers. Implement hot and cold aisle containment strategies to optimize cooling efficiency and PUE.
  • Liquid Cooling Implementation: Oversee, deploy, and maintain advanced liquid cooling solutions essential for modern GPU-dense clusters (e.g., NVIDIA Hopper and Blackwell architectures). This includes direct-to-chip (D2C) cooling, rear door heat exchangers (RDHx), single/two-phase immersion cooling, and spray/jet cooling.
  • Hydronic Loop Maintenance: Direct the maintenance of liquid cooling manifolds, piping infrastructure, liquid distribution units (CDUs), secondary coolant chemistry, and leak detection systems.
  • Thermal Sizing: Perform cooling capacity, delta-T, and air volume displacement (CFM/CMH) calculations to optimize temperature and humidity setpoints based on ASHRAE guidelines. Coordinate backup water supply pipelines, storage tanks, and redundant routing to prevent thermal throttling.

3. High-Performance AI Networking & Cabling

  • AI Interconnect Architectures: Design, deploy, and configure high-performance, low-latency network fabrics optimized for massive east-west traffic patterns in distributed AI workloads. This includes managing high-speed cabling for InfiniBand (up to 800G) and Spectrum-X Ethernet networking.
  • Cabling Topology Compliance: Implement structured cabling layouts following TIA-942 cabling topologies (including Top of Rack (ToR) and End of Row (EoR) designs) with a strong emphasis on rail-optimized network topologies for multi-node GPU clusters.
  • Cabling Quality & Compliance: Enforce best practices for cable management, including routing, bend radius, containment fill ratios, fiber link loss budgets, and telecommunications labeling (aligned with TIA-606). Manage physical path separation requirements, such as maintaining 20-meter separation for redundant access provider pits.

4. AI System Bring-up, Deployment, & Cluster Operations

  • Server & GPU Bring-up: Perform initial physical installation, validation, and configuration of advanced accelerated computing systems, including GPU-based servers, NVIDIA DGX, HGX, MGX, and OVX platforms.
  • System Initialization: Configure and manage Baseboard Management Controllers (BMC), out-of-band (OOB) networks, Unified Extensible Firmware Interface (UEFI), and Trusted Platform Modules (TPM 2.0). Perform firmware upgrades across components (GPUs, CPUs, DPUs, switches).
  • Control Plane Configuration: Install and verify cluster management systems, including Base Command Manager (BCM), operating systems, and cluster schedulers (such as Slurm, Enroot, and Pyxis).
  • Virtualization & Containers: Install, update, and manage GPU and DOCA drivers, the NVIDIA container toolkit, and Docker/Kubernetes environments to support multi-tenant, virtualized, and partitioned (MIG) GPU operations.

5. Performance Verification, Diagnostics, & Maintenance

  • Cluster Benchmarking: Execute single-node and cluster-wide performance validation and stress tests, including High-Performance Linpack (HPL) burn-in, NeMo burn-in, and NCCL (NVIDIA Collective Communications Library) diagnostics to verify NVLink Switch and E/W fabric bandwidth.
  • Hardware Troubleshooting: Diagnose, isolate, and resolve hardware faults (GPU, CPU, memory, power supplies, fans, transceivers) in production systems. Leverage real-time telemetry tools, such as Data Center GPU Manager (DCGM) and What Just Happened (WJH) services.
  • Operations & Lifecycle: Collaborate with cross-functional IT and facilities teams to manage the entire infrastructure lifecycle, from early-stage conceptual sizing and Factory Acceptance Testing (FAT) to decommissioning and retirement planning.
  • Data Center Efficiency (PUE): Actively track, measure, and optimize facility efficiency metrics, including Power Usage Effectiveness (PUE Category 1 to 3), Water Usage Effectiveness (WUE), Carbon Usage Effectiveness (CUE), and Renewable Energy Factor (REF). Maintain strict environmental standards to control dust and corrosion (ISO 14644 and ISA-71).


Required Qualifications & Experience

1. Education & Experience

  • Bachelor’s Degree in Electrical Engineering, Mechanical Engineering, Computer Science, IT Systems, or a related technical/infrastructure field. (Substantial military experience or high-voltage critical facility trade experience may be considered in lieu of a degree).
  • 3 to 5 years of hands-on operations or facilities engineering experience in high-availability, mission-critical environments (enterprise, colocation, hyperscale, or cloud data centers).

2. Mandatory Professional Certifications

  • Certified Data Centre Professional (CDCP) (EPI/EXIN) OR
  • Certified Data Centre Specialist (CDCS) (EPI/EXIN).
  • Note: Certifications must be valid and kept current via the official 3-year recertification program.

3. AI & Accelerated Computing Core Knowledge

  • System Architecture: Solid understanding of high-performance computing (HPC) system architectures, including CPU vs. GPU design differences, PCIe Gen5/Gen6 topologies, and CPU/GPU memory hierarchies.
  • AI Workload Fundamentals: Conceptual understanding of modern AI systems, distinguishing between the infrastructure requirements of training vs. inference workloads.


Preferred Qualifications

1. Rack-Scale & AI System Expertise

  • Direct hands-on experience deploying, managing, or maintaining rack-scale systems (such as NVIDIA DGX SuperPOD architectures) or modular, prefabricated data center solutions.
  • Familiarity with the installation, configuration, and monitoring of NVIDIA BlueField Data Processing Units (DPUs) and SuperNICs.

2. Advanced Facilities & AI Certifications

  • NVIDIA-Certified Associate: AI Infrastructure and Operations (NCA-AIIO) or NVIDIA-Certified Professional: AI Infrastructure (NCP-AII) certification.
  • NVIDIA-Certified Professional: AI Networking (NCP-AIN) or NVIDIA-Certified Professional: AI Operations (NCP-AIO) certification.
  • EPI Certified Data Centre Expert (CDCE) or Certified TIA-942 Design Consultant (CTDC) credential.

3. Specialized Technical Skills

  • Direct experience operating, commissioning, or designing a Tier III/IV (Rated-3/4) concurrently maintainable or fault-tolerant facility.
  • Proficiency with scripting and automated configurations (such as Ansible playbooks or bash) for networking or system setups.
  • Practical knowledge of telemetry-based monitoring platforms, DCIM software, and digital twins for thermal/power simulation.
  • Valid OSHA 30 card, Journeyman Electrician license, or EPA 608 refrigeration certification.


Important Information

Never provide your bank or credit card details when applying for jobs. Do not transfer any money or complete unrelated online surveys. If you see something suspicious, Report this Job ad.

Learn More