全职 Artificial Intelligence Engineer Jobs, in Evolution Singapore

Artificial Intelligence Engineer

Evolution Singapore

Undisclosed

全职

Singapore

保存

工作地点

Singapore

职位描述

岗位职责

About the Role

This role builds the foundation for production-grade distributed AI training at scale. You will design reusable training recipes, benchmarking frameworks, and evaluation standards that enable large customers to train and compare models efficiently across multi-node GPU clusters.

You’ll work closely with platform, orchestration, and application engineers to turn distributed training best practices into repeatable, customer-facing templates.

Job Details

Build and maintain production-ready distributed training recipes using frameworks such as TorchTitan and Megatron-LM
Define model scaling baselines and tuning guidance across GPU counts, parallelism strategies, and checkpointing patterns
Design and run multi-node communication and performance benchmarks (throughput, MFU, cost, energy efficiency)
Create standardized evaluation harnesses and offline benchmarking suites for model comparison
Publish training efficiency playbooks and benchmark results to guide internal teams and customers

Job Requirements

5–7 years of hands-on experience with distributed ML training (PyTorch/JAX, FSDP, DeepSpeed, multi-node GPU systems)
Deep expertise in GPU performance optimization, memory behavior, and NCCL communication patterns
Proven ability to debug convergence issues and optimize large-scale training throughput
Strong benchmarking discipline with experience designing controlled, repeatable experiments
Practical knowledge of model parallelism trade-offs (FSDP, tensor, pipeline parallelism)

重要安全守则

申请工作时，切勿提供您的银行或信用卡详细资料。不要转账或完成无关的在线调查问卷。如果您发现可疑内容，请举报此招聘广告。

了解更多

现在申请

全职 Artificial Intelligence Engineer Jobs, in Evolution Singapore - Ricebowl

Artificial Intelligence Engineer

Evolution Singapore