Our clients ML Platform team enables 100+ ML scientists and engineers to train, deploy, and monitor models that serve 10M+ QPS across recommendation, search, ads, and GenAI products. Our platform powers e-commerce and content experiences similar to TikTok Shop, with a focus on reliability, speed, and developer velocity.
They treat ML infrastructure as a product and operate at the scale of major social-commerce platforms.
The Role
We are hiring an MLOps Engineer to build and scale the core ML platform used by all ML teams. You will own systems for training, serving, experimentation, and monitoring. Your work directly impacts how fast they can ship new models to production and how reliably they serve millions of users.
What You’ll Do
- Model Serving: Build and operate low-latency, high-throughput online inference services for deep learning and LLM models. Optimize with vLLM, Triton, TensorRT, GPU scheduling, and autoscaling
- Training Infrastructure: Scale distributed training on GPU clusters using Kubernetes, Ray, DeepSpeed, or Megatron. Improve job scheduling, checkpointing, and resource utilization
- ML Platform Products: Develop internal tools for the full ML lifecycle: feature store, model registry, experiment tracking, workflow orchestration, and CI/CD for ML
- GenAI Infra: Build infrastructure for LLM fine-tuning, RAG evaluation, vector database management, and cost/latency monitoring for GenAI workloads
- Data & Feature Platform: Maintain real-time and batch feature pipelines. Ensure data quality, lineage, and SLAs for Spark, Flink, and Kafka jobs
- Observability: Implement monitoring, alerting, and debugging tools for model performance, data drift, training failures, and online serving
- Developer Experience: Reduce friction for ML teams. Provide SDKs, CLI tools, and documentation. Run internal office hours and gather requirements
- Reliability: Own SLOs for critical ML services. Lead incident response and postmortems. Drive capacity planning and cost optimization
Minimum Qualifications
- Education: BS/MS in Computer Science, Engineering, or related field
- Experience: Software engineering, DevOps, or ML engineering, with 3+ years building ML infrastructure or platform services
- Programming: Strong proficiency in Python, Go, or Java. Solid understanding of software design, testing, and distributed systems
- Cloud & Containers: Production experience with Kubernetes, Docker, and AWS/GCP/Azure. Familiar with Terraform or infrastructure-as-code
- ML Systems: Understanding of ML workflows. Experience with at least one: model serving, distributed training, feature stores, or workflow orchestrators like Airflow/Kubeflow
- Data Systems: Experience with Spark, Kafka, or similar large-scale data tools
- Problem Solving: Ability to debug complex systems across ML, data, and infra layers
Preferred Qualifications
- Built ML platforms supporting 50+ ML engineers or 100+ models in production
- Deep expertise in GPU inference optimization: batching, quantization, CUDA, vLLM, Triton Inference Server
- Experience with LLM infra: fine-tuning pipelines, vector DBs like Milvus/Weaviate, prompt/version management
- Knowledge of ML frameworks internals: PyTorch, TensorFlow, JAX
- Experience with Ray, Kubeflow, MLflow, Feast, or Tecton
- Background in high-QPS online services, SRE, or performance engineering
- Contributions to open-source ML infra projects