QNT Partners Hiring! Full Time ML Research Platform Engineer (Distributed Training - HPC) in

ML Research Platform Engineer (Distributed Training - HPC)

QNT Partners

Undisclosed

Full Time

Singapore

Save

Working Location

Singapore

Job Description

Responsibilities

Location: Singapore, Hong Kong or Shanghai

About the role

We are looking for a platform engineer to build the infrastructure that powers our next-generation machine learning research. Think: large-scale experimentation, distributed training, and reproducibility.

This is not an applied ML role. You will not be fine-tuning LLMs or building agents. Instead, you will build the systems that enable researchers to train models at scale

What you will own

Distributed training pipelines for GPU-accelerated workloads (PyTorch, JAX)
Experiment management and model versioning
Resource scheduling on on-premise HPC clusters and cloud (Slurm, Kubernetes)
Observability and debugging for complex training jobs
Data lineage and artifact tracking

Must haves (non-negotiable)

2+ years building large-scale distributed systems for research or data-intensive workloads
Strong **Python** and experience writing high-performance, maintainable code
Deep familiarity with modern ML frameworks (PyTorch, TensorFlow, or JAX)
Hands-on experience with GPU-based training across multiple nodes
Comfortable working directly with researchers or quantitative analysts

Strong pluses

Experience with Slurm, PBS, or similar HPC schedulers
Built or maintained experiment tracking and model versioning systems
Worked in quantitative finance, simulation, or other performance-sensitive domains
Experience with DeepSpeed, FSDP, or Megatron-LM
Familiarity with NCCL, Ray, or Kubeflow

Not a fit if

Your recent work is focused on LLM agents, RAG, chatbots, or inference serving
You have not worked with distributed training across multiple GPUs
You prefer building user-facing products over research infrastructure

Important Information

Never provide your bank or credit card details when applying for jobs. Do not transfer any money or complete unrelated online surveys. If you see something suspicious, Report this Job ad.

Learn More

Apply

QNT Partners Hiring! Full Time ML Research Platform Engineer (Distributed Training - HPC) in - Ricebowl

ML Research Platform Engineer (Distributed Training - HPC)

QNT Partners