We are looking for an experienced Senior LLM Deployment & Inference Optimization Engineer to build and operate self-hosted inference infrastructure for LLMs, multimodal models, ASR, and TTS systems in the cloud. Your mission is to deliver a stable, low-latency, and cost-efficient inference platform that powers real-time conversations and voice interactions in AI-driven English learning classrooms. This is a senior, cross-functional engineering role focused on deploying, optimizing, and operating open-source inference engines and GPU infrastructure at scale, rather than developing inference kernels from scratch.
Responsibilities
- Design, deploy, and operate self-hosted cloud inference services for LLMs, multimodal models, ASR, and TTS systems, building highly available and elastically scalable inference infrastructure.
- Optimize and productionize open-source inference frameworks such as vLLM, SGLang, TensorRT-LLM, Triton, and TGI, focusing on: Throughput, Latency, time-to-First-Token (TTFT), Continuous batching, KV cache optimization, Quantization and Parallelization strategies
- Achieve the optimal balance between user experience and infrastructure cost.
- Manage and optimize GPU resources and infrastructure costs, including: Instance selection, GPU utilization improvements, Scheduling and workload co-location, Spot and reserved instance strategies and Cost-per-inference optimization
- Build reliability, observability, and performance management systems for inference services, including: Monitoring and alerting, Load testing, Capacity planning, Rate limiting
- Graceful degradation and disaster recovery
- GPU memory management and OOM mitigation
- Ensure high SLA performance for real-time production workloads.
- Improve model-serving engineering capabilities, including: Multi-model routing, Load balancing, Auto-scaling, Canary deployments and Rollback mechanisms
- Support rapid and reliable model iteration
- Collaborate closely with AI researchers, backend engineers, and application teams to establish an end-to-end path from model development to production deployment.
Requirements
- Bachelor's degree or above in Computer Science or a related field.
- 5+ years of experience in backend engineering, infrastructure engineering, MLOps, or related domains.
- Proven production experience with self-hosted model inference systems
- Independently deployed or led deployment of LLM, multimodal, or speech models in production environments.
- Responsible for real-world reliability, scalability, and cost management—not just proof-of-concept or demo deployments.
- Strong hands-on experience with one or more of: vLLM, SGLang, TensorRT-LLM, Triton Inference Server and Hugging Face TGI
- Able to understand their internals and perform advanced service optimization.
- Deep understanding of inference optimization techniques, including: Transformer inference mechanisms, KV Cache, Continuous/Dynamic Batching, Quantization (INT8, FP8, AWQ, GPTQ, etc.), Tensor Parallelism (TP), Pipeline Parallelism (PP) and PagedAttention
- With proven experience tuning and deploying these techniques in production.
- Strong knowledge of cloud-native infrastructure and GPU environments: Docker, Kubernetes, AWS, GCP, Alibaba Cloud, or similar platforms
- GPU resource scheduling and utilization optimization
- Infrastructure cost optimization
- Solid systems engineering and reliability background: Distributed systems, High-concurrency services, High-availability architectures, Monitoring and observability, Load testing, Capacity planning and Production troubleshooting
- Strong data-driven mindset toward SLA and infrastructure efficiency.
Preferred Qualifications
- Experience optimizing real-time or streaming inference systems, including streaming generation and low TTFT workloads.
- Experience deploying and accelerating: ASR systems, TTS systems, Speech models, Multimodal models
- Experience building or operating: Large-scale GPU clusters, Inference scheduling platforms, Model serving platforms
- Familiarity with: CUDA programming, GPU kernel optimization
- Model compilation technologies such as TensorRT, TVM, and torch.compile
- Understanding of model fine-tuning, distillation, and compression techniques, with awareness of the interplay between training and inference.
- Demonstrated success in: Significantly reducing LLM inference costs and Building inference infrastructure from 0 to 1