About the Role
This role builds the foundation for production-grade distributed AI training at scale. You will design reusable training recipes, benchmarking frameworks, and evaluation standards that enable large customers to train and compare models efficiently across multi-node GPU clusters.
You’ll work closely with platform, orchestration, and application engineers to turn distributed training best practices into repeatable, customer-facing templates.
Job Details
- Build and maintain production-ready distributed training recipes using frameworks such as TorchTitan and Megatron-LM
- Define model scaling baselines and tuning guidance across GPU counts, parallelism strategies, and checkpointing patterns
- Design and run multi-node communication and performance benchmarks (throughput, MFU, cost, energy efficiency)
- Create standardized evaluation harnesses and offline benchmarking suites for model comparison
- Publish training efficiency playbooks and benchmark results to guide internal teams and customers
Job Requirements
- 5–7 years of hands-on experience with distributed ML training (PyTorch/JAX, FSDP, DeepSpeed, multi-node GPU systems)
- Deep expertise in GPU performance optimization, memory behavior, and NCCL communication patterns
- Proven ability to debug convergence issues and optimize large-scale training throughput
- Strong benchmarking discipline with experience designing controlled, repeatable experiments
- Practical knowledge of model parallelism trade-offs (FSDP, tensor, pipeline parallelism)