300+ Site Reliability Engineering Jobs in Malaysia | Job Vacancies | June 2026 | Ricebowl

Showing 335 jobs results for "site reliability engineering"

Never miss any updates for Site Reliability Engineering jobs

SGD7,000 - SGD9,000 Per Month

Singapore

  • - 处理突发重大故障和普通故障,进行服务恢复。分析事件的根本原因,并改进和优化。
  • - 开发和维护自动化运维工具,提高运维工作效率,优化运维流程。
  • - 提供 7 * 24 OnCall 技术支持服务,5 * 8工作时间服务。 ...
Posted
a month ago

Truewatch Technology Inc. Pte Ltd

SGD7,000 - SGD10,000 Per Month

Singapore

  • Manage and optimize cloud infrastructure on AWS, Azure, or GCP to improve resource utilization and cost efficiency
  • Implement Infrastructure as Code (IaC) and automate deployments through CI/CD pipelines to accelerate delivery and reduce errors
  • Enhance system scalability, resilience, and operational efficiency by identifying and applying improvements ...
Posted
a month ago
Undisclosed

Singapore

  • Proficient in GPU/ML principles and cloud platforms (eg. AWS) ; Hands-on experience in GPU hardware/drivers, CUDA, NCCL, and Mellanox network operations/optimization; Data center experience preferred.
  • Familiar with cloud native container technologies and disaster recovery solutions ; Practical Docker/Kubernetes operations experience required.
  • Skilled in Linux/Shell environments; Proficient in ≥1 language ( Go/Python/Java ); Adept at leveraging automation/AI-driven methods to further enhance service stability and efficiency. ...
Posted
10 days ago
SGD9,000 - SGD9,000 Per Month

Singapore

  • Implement and enhance monitoring and observability solutions (Grafana, Datadog).
  • Manage incidents and improve resilience and recovery processes
  • Collaborate with IT, DevOps, and Cybersecurity teams to ensure infrastructure compliance and security. ...
Posted
10 days ago
Undisclosed

Singapore

  • Monitor system performance, troubleshoot issues, and ensure optimal operation
  • Partner with development teams to improve system reliability and performance at the code and architecture level
  • Develop and implement automation tools to streamline operations and reduce toil ...
Posted
a month ago
MYR19,000 - MYR19,000 Per Month

KL City

  • Conduct thorough post-mortem analyses following incidents, driving continuous improvement through root cause identification and solution implementation.
  • Collaborate with development and operations teams to establish best practices in system reliability and incident management.
  • Troubleshoot and resolve issues related to database performance, network connectivity, and deployment failures, including diagnosing problems at the underlying platform level (e.g., Kubernetes, virtual machines). ...
Posted
a month ago
Undisclosed

Singapore

Posted
a month ago
Undisclosed

KL City

  • Implement monitoring, alerting, SLIs, SLOs, and SLA tracking.
  • Participate in 24/7 on-call rotations and incident response activities.
  • Conduct root cause analysis and support post-mortem reviews. ...
Posted
a month ago
Undisclosed

Singapore

  • CI/CD
  • Python
  • IaC – Terraform, Helm, Ansible, Pulumi, Bicep ...
Posted
12 days ago
Undisclosed

Singapore

  • Build and maintain production tooling that supports deployment, orchestration, monitoring, and system diagnostics
  • Define and maintain observability, SLI/SLOs, and performance metrics in partnership with product owners
  • Leverage metrics and capacity planning to ensure scalability and uptime ...
Posted
a month ago
Undisclosed

Singapore

  • Understanding of cloud computing concepts (AWS, Azure, or GCP)
  • Interest in DevOps, infrastructure, automation, and cloud technologies
  • Basic knowledge of monitoring, logging, and alerting systems ...
Posted
12 days ago
Undisclosed

Hong Kong

  • Identify opportunities to eliminate toil through automation, code improvements, and process optimizations.
  • Conduct root cause analyses for system failures and incidents, and implement engineering solutions to prevent future occurrences.
  • Lead incident management and resolution efforts, ensuring timely and effective response to incidents, and driving post-incident reviews and process improvements. ...
Posted
15 days ago
Undisclosed

KL City

  • Collaboration at its Best: Work closely with product teams, stakeholders, and global support. Immerse in and contribute to a rich tapestry of insights and expertise.
  • Mentorship and Growth: Guide budding engineers and share best practices, fostering a collective ascent.
  • Tech Evaluation: Regularly scrutinize platforms and apps, suggesting improvements rooted in data and hands-on experience ...
Posted
a month ago
Undisclosed

Singapore

  • CI/CD
  • Python
  • IaC – Terraform, Helm, Ansible, Pulumi, Bicep ...
Posted
14 days ago
MYR11,000 - MYR11,000 Per Month
  • EPF & SOCSO
  • Annual Leave
  • Medical Leave ...
Posted
2 days ago
Undisclosed

Singapore

  • Drive capacity planning, performance optimization, disaster recovery, and business continuity planning.
  • Build and manage cloud-native infrastructure across AWS, Azure, GCP or hybrid environments.
  • Implement Infrastructure-as-Code (IaC) using tools such as Terraform, Ansible or Helm. ...
Posted
20 days ago
Undisclosed

台灣

  • Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users.
  • Deploy, manage, and optimize containerized workloads running on Kubernetes.
  • Maintain scalable cloud infrastructure across production environments. ...
Posted
17 days ago
Undisclosed

Singapore

  • CI/CD
  • Python
  • IaC – Terraform, Helm, Ansible, Pulumi, Bicep ...
Posted
17 days ago
Undisclosed

Singapore

  • Incident Response: Lead incident response for complex issues, perform postmortems, and implement preventive measures.
  • Mentorship & Collaboration: Mentor junior SREs, collaborate with DevOps and engineering teams, and promote reliability best practices.
  • Runbooks & Documentation: Maintain and enhance operational runbooks, ensuring production knowledge is standardized. ...
Posted
25 days ago
Undisclosed

Singapore

  • Build robust incident management mechanism. Lead efforts to troubleshoot and resolve service incidents and postmortems. Coordinate with cross-functional teams to manage and mitigate service-impacting events.
  • Develop highly efficient toolchains covering end-to-end deployment and reliability assurance operations. Automate infrastructure provisioning, scaling, and management processes to reduce manual interventions and improve service quality. Develop and enhance system capabilities such as auto-failure-detection, auto-healing, chaotic engineering, and perform systematic disaster drills.
  • Engage with product and development teams to integrate reliability and performance considerations into the software lifecycle. ...
Posted
17 days ago
Undisclosed

KL City

  • What You’ll Be Doing
  • Monitor, maintain, and improve the reliability, availability, and performance of production systems and services.
  • Build and maintain infrastructure as code (IaC), deployment pipelines, and automation to support continuous delivery, scalability, and disaster recovery. ...
Posted
a month ago
Undisclosed
WFH

Singapore

  • What You’ll Be Doing
  • Monitor, maintain, and improve the reliability, availability, and performance of production systems and services.
  • Build and maintain infrastructure as code (IaC), deployment pipelines, and automation to support continuous delivery, scalability, and disaster recovery. ...
Posted
a month ago
Undisclosed

Singapore

  • Flat Structure: Enjoy a flat organizational structure that promotes collaboration and provides opportunities to learn from seasoned technical leadership
  • Automate Routine Tasks: Develop tools to automate administrative tasks, reducing manual intervention and improving efficiency.
  • Optimize System Performance: Create automated solutions to monitor and maintain system performance, ensuring reliability and scalability. ...
Posted
a month ago
Undisclosed
WFH

Singapore

  • If you enjoy thinking in systems, debugging complex issues, and preventing problems before they happen, this role will push you to grow fast.
  • Help define SLOs, SLIs, and error budgets for core systems
  • Set up and tune monitoring and alerting (Prometheus, Grafana, OpenTelemetry) ...
Posted
a month ago
SGD7,000 - SGD8,500 Per Month

Singapore

  • Experience with Application Servers, preferably IBM WebSphere / Apache Tomcat8.5.x
  • Excellent and proven experience in Oracle SQL and PL/SQL
  • Experience with monitoring tools such as Tivoli, and Splunk ...
Posted
a month ago
Undisclosed

Singapore

  • Measure and monitor availability, latency and overall service health.
  • Practice sustainable incident response and postmortems.
  • Participate in on-call rotations across continents. ...
Posted
6 days ago
Undisclosed

Singapore

  • Contribute to post-incident reviews and continuous improvement initiatives.
  • Design, implement, and maintain monitoring dashboards.
  • Improve alert quality and reduce noise through effective threshold and metric design. ...
Posted
a month ago