jobs in NEWBRIDGE ALLIANCE PTE. LTD.

全职 Machine Learning (Ops) Engineer 工作, 薪水 up to SGD 10,000, NEWBRIDGE ALLIANCE PTE. LTD. Central Region (Singapore) 公司招聘中 - Ricebowl

Machine Learning (Ops) Engineer

NEWBRIDGE ALLIANCE PTE. LTD.

SGD10,000 - SGD10,000 每月

Central Region (Singapore)

分享
保存

工作地点

  • 10 ANSON ROAD Central Region (Singapore) Singapore

职位描述

岗位职责

Our clients ML Platform team enables 100+ ML scientists and engineers to train, deploy, and monitor models that serve 10M+ QPS across recommendation, search, ads, and GenAI products. Our platform powers e-commerce and content experiences similar to TikTok Shop, with a focus on reliability, speed, and developer velocity.

They treat ML infrastructure as a product and operate at the scale of major social-commerce platforms.

The Role

We are hiring an MLOps Engineer to build and scale the core ML platform used by all ML teams. You will own systems for training, serving, experimentation, and monitoring. Your work directly impacts how fast they can ship new models to production and how reliably they serve millions of users.

What You’ll Do

  • Model Serving: Build and operate low-latency, high-throughput online inference services for deep learning and LLM models. Optimize with vLLM, Triton, TensorRT, GPU scheduling, and autoscaling
  • Training Infrastructure: Scale distributed training on GPU clusters using Kubernetes, Ray, DeepSpeed, or Megatron. Improve job scheduling, checkpointing, and resource utilization
  • ML Platform Products: Develop internal tools for the full ML lifecycle: feature store, model registry, experiment tracking, workflow orchestration, and CI/CD for ML
  • GenAI Infra: Build infrastructure for LLM fine-tuning, RAG evaluation, vector database management, and cost/latency monitoring for GenAI workloads
  • Data & Feature Platform: Maintain real-time and batch feature pipelines. Ensure data quality, lineage, and SLAs for Spark, Flink, and Kafka jobs
  • Observability: Implement monitoring, alerting, and debugging tools for model performance, data drift, training failures, and online serving
  • Developer Experience: Reduce friction for ML teams. Provide SDKs, CLI tools, and documentation. Run internal office hours and gather requirements
  • Reliability: Own SLOs for critical ML services. Lead incident response and postmortems. Drive capacity planning and cost optimization

Minimum Qualifications

  • Education: BS/MS in Computer Science, Engineering, or related field
  • Experience: Software engineering, DevOps, or ML engineering, with 3+ years building ML infrastructure or platform services
  • Programming: Strong proficiency in Python, Go, or Java. Solid understanding of software design, testing, and distributed systems
  • Cloud & Containers: Production experience with Kubernetes, Docker, and AWS/GCP/Azure. Familiar with Terraform or infrastructure-as-code
  • ML Systems: Understanding of ML workflows. Experience with at least one: model serving, distributed training, feature stores, or workflow orchestrators like Airflow/Kubeflow
  • Data Systems: Experience with Spark, Kafka, or similar large-scale data tools
  • Problem Solving: Ability to debug complex systems across ML, data, and infra layers

Preferred Qualifications

  • Built ML platforms supporting 50+ ML engineers or 100+ models in production
  • Deep expertise in GPU inference optimization: batching, quantization, CUDA, vLLM, Triton Inference Server
  • Experience with LLM infra: fine-tuning pipelines, vector DBs like Milvus/Weaviate, prompt/version management
  • Knowledge of ML frameworks internals: PyTorch, TensorFlow, JAX
  • Experience with Ray, Kubeflow, MLflow, Feast, or Tecton
  • Background in high-QPS online services, SRE, or performance engineering
  • Contributions to open-source ML infra projects

重要安全守则

申请工作时,切勿提供您的银行或信用卡详细资料。不要转账或完成无关的在线调查问卷。如果您发现可疑内容,请举报此招聘广告。

了解更多