- Singapore Singapore
Working Location
Job Description
Responsibilities
Job Location: Singapore (Onsite)
Job Summary:
We are looking for a GPU / AI Infrastructure Engineer with 5–7 years of experience to build, optimize, and support scalable AI/ML and HPC environments. The ideal candidate will have strong expertise in GPU acceleration, containerized workloads, and MLOps pipelines, along with hands-on experience managing AI infrastructure across on-prem or cloud platforms.
Key Responsibilities
· Design, deploy, and manage GPU-enabled infrastructure for AI/ML and HPC workloads.
· Install, configure, and optimize GPU software stacks including NVIDIA AI Enterprise, CUDA, ROCm, OpenCL, and NIMS.
· Support GPU acceleration for machine learning frameworks and scientific applications.
· Build and manage containerized environments using Docker, Kubernetes (K8s), and Singularity.
· Deploy and manage Kubernetes GPU workloads using GPU Operator and related ecosystem tools.
· Support ML frameworks such as TensorFlow, PyTorch, Scikit-learn, and MXNet.
· Develop and maintain MLOps pipelines using MLflow and Kubeflow.
· Design and implement Infrastructure as Code (IaC) solutions for AI/ML pipelines.
· Automate infrastructure provisioning using Terraform, Pulumi, and CloudFormation.
· Build and maintain CI/CD pipelines for ML model deployment and infrastructure automation.
· Collaborate with data scientists and engineers to optimize model performance and resource utilization.
· Monitor GPU utilization, system performance, and troubleshoot issues across the stack.
· Ensure scalability, reliability, and security of AI infrastructure environments.
Required Skills & Qualifications
· 5 years of experience in AI/ML infrastructure, HPC, or DevOps engineering roles.
· Strong experience with GPU technologies and acceleration frameworks (CUDA, ROCm, OpenCL).
· Hands-on experience with NVIDIA AI Enterprise stack and GPU ecosystem tools (e.g., NIMS, GPU Operator).
· Proficiency in container technologies: Docker, Kubernetes, and Singularity.
· Experience working with ML frameworks: TensorFlow, PyTorch, Scikit-learn, MXNet.
· Solid understanding of MLOps tools such as MLflow and Kubeflow.
· Expertise in Infrastructure as Code (Terraform, Pulumi, CloudFormation).
· Experience building and managing CI/CD pipelines for ML or infrastructure workflows.
· Strong scripting skills (Python, Bash, or similar).
· Familiarity with Linux-based environments.
Important Information
Never provide your bank or credit card details when applying for jobs. Do not transfer any money or complete unrelated online surveys. If you see something suspicious, Report this Job ad.