jobs in NEXadept

NEXadept Hiring! Full Time AI Training Infrastructure Engineer in - Ricebowl

AI Training Infrastructure Engineer

NEXadept

Undisclosed

Singapore

Share
Save

Working Location

  • Singapore

Job Description

Responsibilities

About the Role

Our client is building its own foundation model for video generation, based on DiT and Flow Matching architectures. They are looking for a Training Infrastructure Engineer who can turn cutting-edge research code into a stable, scalable, and high-throughput training system running on large-scale GPU clusters.


This role is ideal for an engineer who enjoys solving deep systems problems at the intersection of distributed training, CUDA performance, video data pipelines, model training stability, and large-scale ML infrastructure. You will work closely with researchers and platform engineers to ensure that our video generation training stack can reliably produce results at the thousand-GPU scale.


Key Responsibilities

  • You will design, optimise, and maintain large-scale distributed training systems for video generation foundation models. This includes implementing and improving training strategies such as FSDP, tensor parallelism, context parallelism, and Ulysses-style sequence parallelism, with a strong focus on improving throughput, scaling efficiency, and MFU.
  • You will build and optimise PB-scale video data pipelines, including NVDEC-based video decoding, VAE latent caching, variable-resolution bucket sampling, and efficient data loading for high-throughput model training.
  • You will work on memory and performance optimisation across the training stack, including FlashAttention, FP8 mixed precision, Triton kernels, CUDA-aware profiling, activation checkpointing strategies, and communication-computation overlap.
  • You will also be responsible for training stability and reliability. This includes identifying the root causes of loss spikes, divergence, slow nodes, communication bottlenecks, checkpoint failures, and data-related instability, as well as designing mechanisms for fast checkpoint recovery and automatic exclusion of problematic nodes.


Requirements

  • The ideal candidate has strong hands-on experience with PyTorch distributed training and a solid understanding of CUDA architecture, GPU memory hierarchy, NCCL communication, and performance profiling.
  • You should have source-level familiarity with at least one major large-scale training framework, such as Megatron-LM, DeepSpeed, PyTorch FSDP, or TorchTitan, and be comfortable reading, modifying, and debugging framework internals.
  • You should have at least one year of practical experience training models on large GPU clusters of 256 GPUs or more, with proven experience in debugging distributed training failures and improving system-level training efficiency.
  • Strong candidates will be able to reason across the full training stack, from data ingestion and model parallelism to kernel-level optimisation and fault-tolerant training operations.


Preferred Qualifications

  • Experience with DiT, diffusion models, Flow Matching, or video generation models would be highly advantageous.
  • Experience processing large-scale video datasets, building video decoding pipelines, or working with VAE latent caching systems would be a strong plus.
  • Hands-on experience writing or optimising Triton, CUDA, or CUTLASS kernels would be valuable.
  • Familiarity with open-source video generation projects such as HunyuanVideo, Wan, CogVideoX, or similar systems at source-code level would also be beneficial.

Important Information

Never provide your bank or credit card details when applying for jobs. Do not transfer any money or complete unrelated online surveys. If you see something suspicious, Report this Job ad.

Learn More