VIPKid Hiring! Full Time Senior LLM Deployment - Inference Optimization Engineer in

Senior LLM Deployment - Inference Optimization Engineer

VIPKid

Undisclosed

Full Time

Singapore

Save

Working Location

Singapore

Job Description

Responsibilities

We are looking for an experienced Senior LLM Deployment & Inference Optimization Engineer to build and operate self-hosted inference infrastructure for LLMs, multimodal models, ASR, and TTS systems in the cloud. Your mission is to deliver a stable, low-latency, and cost-efficient inference platform that powers real-time conversations and voice interactions in AI-driven English learning classrooms. This is a senior, cross-functional engineering role focused on deploying, optimizing, and operating open-source inference engines and GPU infrastructure at scale, rather than developing inference kernels from scratch.

Responsibilities

Design, deploy, and operate self-hosted cloud inference services for LLMs, multimodal models, ASR, and TTS systems, building highly available and elastically scalable inference infrastructure.
Optimize and productionize open-source inference frameworks such as vLLM, SGLang, TensorRT-LLM, Triton, and TGI, focusing on: Throughput, Latency, time-to-First-Token (TTFT), Continuous batching, KV cache optimization, Quantization and Parallelization strategies
Achieve the optimal balance between user experience and infrastructure cost.
Manage and optimize GPU resources and infrastructure costs, including: Instance selection, GPU utilization improvements, Scheduling and workload co-location, Spot and reserved instance strategies and Cost-per-inference optimization
Build reliability, observability, and performance management systems for inference services, including: Monitoring and alerting, Load testing, Capacity planning, Rate limiting
Graceful degradation and disaster recovery
GPU memory management and OOM mitigation
Ensure high SLA performance for real-time production workloads.
Improve model-serving engineering capabilities, including: Multi-model routing, Load balancing, Auto-scaling, Canary deployments and Rollback mechanisms
Support rapid and reliable model iteration
Collaborate closely with AI researchers, backend engineers, and application teams to establish an end-to-end path from model development to production deployment.

Requirements

Bachelor's degree or above in Computer Science or a related field.
5+ years of experience in backend engineering, infrastructure engineering, MLOps, or related domains.
Proven production experience with self-hosted model inference systems
Independently deployed or led deployment of LLM, multimodal, or speech models in production environments.
Responsible for real-world reliability, scalability, and cost management—not just proof-of-concept or demo deployments.
Strong hands-on experience with one or more of: vLLM, SGLang, TensorRT-LLM, Triton Inference Server and Hugging Face TGI
Able to understand their internals and perform advanced service optimization.
Deep understanding of inference optimization techniques, including: Transformer inference mechanisms, KV Cache, Continuous/Dynamic Batching, Quantization (INT8, FP8, AWQ, GPTQ, etc.), Tensor Parallelism (TP), Pipeline Parallelism (PP) and PagedAttention
With proven experience tuning and deploying these techniques in production.
Strong knowledge of cloud-native infrastructure and GPU environments: Docker, Kubernetes, AWS, GCP, Alibaba Cloud, or similar platforms
GPU resource scheduling and utilization optimization
Infrastructure cost optimization
Solid systems engineering and reliability background: Distributed systems, High-concurrency services, High-availability architectures, Monitoring and observability, Load testing, Capacity planning and Production troubleshooting
Strong data-driven mindset toward SLA and infrastructure efficiency.

Preferred Qualifications

Experience optimizing real-time or streaming inference systems, including streaming generation and low TTFT workloads.
Experience deploying and accelerating: ASR systems, TTS systems, Speech models, Multimodal models
Experience building or operating: Large-scale GPU clusters, Inference scheduling platforms, Model serving platforms
Familiarity with: CUDA programming, GPU kernel optimization
Model compilation technologies such as TensorRT, TVM, and torch.compile
Understanding of model fine-tuning, distillation, and compression techniques, with awareness of the interplay between training and inference.
Demonstrated success in: Significantly reducing LLM inference costs and Building inference infrastructure from 0 to 1

Important Information

Never provide your bank or credit card details when applying for jobs. Do not transfer any money or complete unrelated online surveys. If you see something suspicious, Report this Job ad.

Learn More

Apply

VIPKid Hiring! Full Time Senior LLM Deployment - Inference Optimization Engineer in - Ricebowl

Senior LLM Deployment - Inference Optimization Engineer

VIPKid