30 RAFFLES PLACE Central Region (Singapore) Singapore
职位描述
岗位职责
Responsibilities:
Handle production incidents and post-mortem analysis for system stability improvement
Designing, deploying, monitoring, and troubleshooting Kafka and Redis clusters in PROD environment, ensuring optimal performance and reliability
Work closely with development teams to ensure seamless deployment of applications or systems
Manage and optimize cloud infrastructure (AWS, Alicloud) for performance, cost, and reliability
Develop Devops platform like online load test, change management system
Leverage LLMs or AI frameworks (OpenAI, Dify, Agno, LangChain) to enhance automation in infrastructure operations, including intelligent alert triage, RCA (Root Cause Analysis), and chat-based operations (ChatOps)
Continuously explore and integrate AI-driven insights into operational processes to improve reliability, reduce noise, and empower engineering teams with intelligent decision-making.
Qualifications:
5+ years of hands-on experience in Kafka and Redis operations in large-scale production environments, be able to cooperate with developers to optimize code
Proficient in Python / Go / Java (at least one language) and SQL programming languages
Hands-on experience with containerization and orchestration (Docker, Kubernetes)
Strong experience with CI/CD tools such as GitHub Actions, Ansible, Terraform etc
At least 3 years of experience with AWS cloud platform. GCP, Azure, or Ali Cloud is a plus
Excellent problem-solving and troubleshooting skills
Strong team collaboration attitude and develop partnership with other teams and business
Practical experience building or operating AIOps systems (anomaly detection, alert correlation, automated healing, or RCA)
Familiarity with LLM-based DevOps automation (e.g., building chat-based ops assistants or AI-driven observability workflows)
Experience using or integrating tools like Dify, Agno, or LangChain into operational workflows