jobs in VIPKid

VIPKid Hiring! Full Time Senior LLM Deployment - Inference Optimization Engineer in - Ricebowl

Senior LLM Deployment - Inference Optimization Engineer

VIPKid

Undisclosed

Singapore

Share
Save

Working Location

  • Singapore

Job Description

Responsibilities

We are looking for an experienced Senior LLM Deployment & Inference Optimization Engineer to build and operate self-hosted inference infrastructure for LLMs, multimodal models, ASR, and TTS systems in the cloud. Your mission is to deliver a stable, low-latency, and cost-efficient inference platform that powers real-time conversations and voice interactions in AI-driven English learning classrooms. This is a senior, cross-functional engineering role focused on deploying, optimizing, and operating open-source inference engines and GPU infrastructure at scale, rather than developing inference kernels from scratch.


Responsibilities

  • Design, deploy, and operate self-hosted cloud inference services for LLMs, multimodal models, ASR, and TTS systems, building highly available and elastically scalable inference infrastructure.
  • Optimize and productionize open-source inference frameworks such as vLLM, SGLang, TensorRT-LLM, Triton, and TGI, focusing on: Throughput, Latency, time-to-First-Token (TTFT), Continuous batching, KV cache optimization, Quantization and Parallelization strategies
  • Achieve the optimal balance between user experience and infrastructure cost.
  • Manage and optimize GPU resources and infrastructure costs, including: Instance selection, GPU utilization improvements, Scheduling and workload co-location, Spot and reserved instance strategies and Cost-per-inference optimization
  • Build reliability, observability, and performance management systems for inference services, including: Monitoring and alerting, Load testing, Capacity planning, Rate limiting
  • Graceful degradation and disaster recovery
  • GPU memory management and OOM mitigation
  • Ensure high SLA performance for real-time production workloads.
  • Improve model-serving engineering capabilities, including: Multi-model routing, Load balancing, Auto-scaling, Canary deployments and Rollback mechanisms
  • Support rapid and reliable model iteration
  • Collaborate closely with AI researchers, backend engineers, and application teams to establish an end-to-end path from model development to production deployment.


Requirements

  • Bachelor's degree or above in Computer Science or a related field.
  • 5+ years of experience in backend engineering, infrastructure engineering, MLOps, or related domains.
  • Proven production experience with self-hosted model inference systems
  • Independently deployed or led deployment of LLM, multimodal, or speech models in production environments.
  • Responsible for real-world reliability, scalability, and cost management—not just proof-of-concept or demo deployments.
  • Strong hands-on experience with one or more of: vLLM, SGLang, TensorRT-LLM, Triton Inference Server and Hugging Face TGI
  • Able to understand their internals and perform advanced service optimization.
  • Deep understanding of inference optimization techniques, including: Transformer inference mechanisms, KV Cache, Continuous/Dynamic Batching, Quantization (INT8, FP8, AWQ, GPTQ, etc.), Tensor Parallelism (TP), Pipeline Parallelism (PP) and PagedAttention
  • With proven experience tuning and deploying these techniques in production.
  • Strong knowledge of cloud-native infrastructure and GPU environments: Docker, Kubernetes, AWS, GCP, Alibaba Cloud, or similar platforms
  • GPU resource scheduling and utilization optimization
  • Infrastructure cost optimization
  • Solid systems engineering and reliability background: Distributed systems, High-concurrency services, High-availability architectures, Monitoring and observability, Load testing, Capacity planning and Production troubleshooting
  • Strong data-driven mindset toward SLA and infrastructure efficiency.

Preferred Qualifications

  • Experience optimizing real-time or streaming inference systems, including streaming generation and low TTFT workloads.
  • Experience deploying and accelerating: ASR systems, TTS systems, Speech models, Multimodal models
  • Experience building or operating: Large-scale GPU clusters, Inference scheduling platforms, Model serving platforms
  • Familiarity with: CUDA programming, GPU kernel optimization
  • Model compilation technologies such as TensorRT, TVM, and torch.compile
  • Understanding of model fine-tuning, distillation, and compression techniques, with awareness of the interplay between training and inference.
  • Demonstrated success in: Significantly reducing LLM inference costs and Building inference infrastructure from 0 to 1


Important Information

Never provide your bank or credit card details when applying for jobs. Do not transfer any money or complete unrelated online surveys. If you see something suspicious, Report this Job ad.

Learn More