Location: Singapore, Hong Kong or Shanghai
About the role
We are looking for a platform engineer to build the infrastructure that powers our next-generation machine learning research. Think: large-scale experimentation, distributed training, and reproducibility.
This is not an applied ML role. You will not be fine-tuning LLMs or building agents. Instead, you will build the systems that enable researchers to train models at scale
What you will own
- Distributed training pipelines for GPU-accelerated workloads (PyTorch, JAX)
- Experiment management and model versioning
- Resource scheduling on on-premise HPC clusters and cloud (Slurm, Kubernetes)
- Observability and debugging for complex training jobs
- Data lineage and artifact tracking
Must haves (non-negotiable)
- 2+ years building large-scale distributed systems for research or data-intensive workloads
- Strong **Python** and experience writing high-performance, maintainable code
- Deep familiarity with modern ML frameworks (PyTorch, TensorFlow, or JAX)
- Hands-on experience with GPU-based training across multiple nodes
- Comfortable working directly with researchers or quantitative analysts
Strong pluses
- Experience with Slurm, PBS, or similar HPC schedulers
- Built or maintained experiment tracking and model versioning systems
- Worked in quantitative finance, simulation, or other performance-sensitive domains
- Experience with DeepSpeed, FSDP, or Megatron-LM
- Familiarity with NCCL, Ray, or Kubeflow
Not a fit if
- Your recent work is focused on LLM agents, RAG, chatbots, or inference serving
- You have not worked with distributed training across multiple GPUs
- You prefer building user-facing products over research infrastructure