jobs in Evolution Singapore

全职 Artificial Intelligence Engineer Jobs, in Evolution Singapore - Ricebowl

Artificial Intelligence Engineer

Evolution Singapore

Undisclosed

Singapore

分享
保存

工作地点

  • Singapore

职位描述

岗位职责

About the Role

This role builds the foundation for production-grade distributed AI training at scale. You will design reusable training recipes, benchmarking frameworks, and evaluation standards that enable large customers to train and compare models efficiently across multi-node GPU clusters.

You’ll work closely with platform, orchestration, and application engineers to turn distributed training best practices into repeatable, customer-facing templates.


Job Details

  • Build and maintain production-ready distributed training recipes using frameworks such as TorchTitan and Megatron-LM
  • Define model scaling baselines and tuning guidance across GPU counts, parallelism strategies, and checkpointing patterns
  • Design and run multi-node communication and performance benchmarks (throughput, MFU, cost, energy efficiency)
  • Create standardized evaluation harnesses and offline benchmarking suites for model comparison
  • Publish training efficiency playbooks and benchmark results to guide internal teams and customers


Job Requirements

  • 5–7 years of hands-on experience with distributed ML training (PyTorch/JAX, FSDP, DeepSpeed, multi-node GPU systems)
  • Deep expertise in GPU performance optimization, memory behavior, and NCCL communication patterns
  • Proven ability to debug convergence issues and optimize large-scale training throughput
  • Strong benchmarking discipline with experience designing controlled, repeatable experiments
  • Practical knowledge of model parallelism trade-offs (FSDP, tensor, pipeline parallelism)

重要安全守则

申请工作时,切勿提供您的银行或信用卡详细资料。不要转账或完成无关的在线调查问卷。如果您发现可疑内容,请举报此招聘广告。

了解更多