全职 Machine Learning (Ops) Engineer 工作, 薪水 up to SGD 10,000, NEWBRIDGE ALLIANCE PTE. LTD. Central Region (Singapore) 公司招聘中

Machine Learning (Ops) Engineer

NEWBRIDGE ALLIANCE PTE. LTD.

SGD10,000 - SGD10,000 每月

全职

Central Region (Singapore)

保存

工作地点

10 ANSON ROAD Central Region (Singapore) Singapore

职位描述

岗位职责

Our clients ML Platform team enables 100+ ML scientists and engineers to train, deploy, and monitor models that serve 10M+ QPS across recommendation, search, ads, and GenAI products. Our platform powers e-commerce and content experiences similar to TikTok Shop, with a focus on reliability, speed, and developer velocity.

They treat ML infrastructure as a product and operate at the scale of major social-commerce platforms.

The Role

We are hiring an MLOps Engineer to build and scale the core ML platform used by all ML teams. You will own systems for training, serving, experimentation, and monitoring. Your work directly impacts how fast they can ship new models to production and how reliably they serve millions of users.

What You’ll Do

Model Serving: Build and operate low-latency, high-throughput online inference services for deep learning and LLM models. Optimize with vLLM, Triton, TensorRT, GPU scheduling, and autoscaling
Training Infrastructure: Scale distributed training on GPU clusters using Kubernetes, Ray, DeepSpeed, or Megatron. Improve job scheduling, checkpointing, and resource utilization
ML Platform Products: Develop internal tools for the full ML lifecycle: feature store, model registry, experiment tracking, workflow orchestration, and CI/CD for ML
GenAI Infra: Build infrastructure for LLM fine-tuning, RAG evaluation, vector database management, and cost/latency monitoring for GenAI workloads
Data & Feature Platform: Maintain real-time and batch feature pipelines. Ensure data quality, lineage, and SLAs for Spark, Flink, and Kafka jobs
Observability: Implement monitoring, alerting, and debugging tools for model performance, data drift, training failures, and online serving
Developer Experience: Reduce friction for ML teams. Provide SDKs, CLI tools, and documentation. Run internal office hours and gather requirements
Reliability: Own SLOs for critical ML services. Lead incident response and postmortems. Drive capacity planning and cost optimization

Minimum Qualifications

Education: BS/MS in Computer Science, Engineering, or related field
Experience: Software engineering, DevOps, or ML engineering, with 3+ years building ML infrastructure or platform services
Programming: Strong proficiency in Python, Go, or Java. Solid understanding of software design, testing, and distributed systems
Cloud & Containers: Production experience with Kubernetes, Docker, and AWS/GCP/Azure. Familiar with Terraform or infrastructure-as-code
ML Systems: Understanding of ML workflows. Experience with at least one: model serving, distributed training, feature stores, or workflow orchestrators like Airflow/Kubeflow
Data Systems: Experience with Spark, Kafka, or similar large-scale data tools
Problem Solving: Ability to debug complex systems across ML, data, and infra layers

Preferred Qualifications

Built ML platforms supporting 50+ ML engineers or 100+ models in production
Deep expertise in GPU inference optimization: batching, quantization, CUDA, vLLM, Triton Inference Server
Experience with LLM infra: fine-tuning pipelines, vector DBs like Milvus/Weaviate, prompt/version management
Knowledge of ML frameworks internals: PyTorch, TensorFlow, JAX
Experience with Ray, Kubeflow, MLflow, Feast, or Tecton
Background in high-QPS online services, SRE, or performance engineering
Contributions to open-source ML infra projects

重要安全守则

申请工作时，切勿提供您的银行或信用卡详细资料。不要转账或完成无关的在线调查问卷。如果您发现可疑内容，请举报此招聘广告。

了解更多

现在申请

全职 Machine Learning (Ops) Engineer 工作, 薪水 up to SGD 10,000, NEWBRIDGE ALLIANCE PTE. LTD. Central Region (Singapore) 公司招聘中 - Ricebowl

Machine Learning (Ops) Engineer

NEWBRIDGE ALLIANCE PTE. LTD.