300+ Site Reliability Engineering Jobs in Malaysia | Job Vacancies | June 2026 | Ricebowl

Showing 354 jobs results for "site reliability engineering"

Never miss any updates for Site Reliability Engineering jobs

Undisclosed

Singapore

  • Improve deployment safety through CI/CD workflows, release controls, rollback paths, and environment consistency
  • Drive incident response and production readiness practices including runbooks, on-call hygiene, postmortems, capacity planning, and resilience testing
  • Reduce operational toil by automating repetitive work and improving internal developer tooling ...
Posted
11 days ago
Undisclosed
WFH

Singapore

  • EngineeringRun Engineering
  • Thought Machine's mission is bold – to properly and permanently rid the world's banks of legacy technology. To achieve this, we have developed the foundations of modern banking through core and payments technology which run natively in the cloud. What we are attempting is hard and means we need great people working together to build great technology.
  • We have grown rapidly in the past few years – growing our team to more than 550 individuals across offices in London, New York, Singapore, Sydney and our newly established Engineering Hub in Lisbon. We have raised more than £500m in funding and our investors include Molten Ventures, Eurazeo, Intesa Sanpaolo, Temasek, Nyca Partners, JPMorgan Chase Strategic Investments, Standard Chartered Ventures, and more. ...
Posted
11 days ago
Undisclosed
WFH

Singapore

  • Thought Machine’s Site Reliability Engineers are the guardians of mission-critical systems for the world's most influential financial institutions. As a member of our elite, globally distributed team, you'll be entrusted with running and maintaining the robust production infrastructure that powers our customers' cutting-edge Core Banking and Payments platforms. This is an opportunity to make a tangible impact on the global financial landscape while collaborating with brilliant minds to solve complex engineering challenges.
  • The team is deeply involved in tackling the technical challenges of executing Thought Machine’s growth ambitions - expect to be working with senior stakeholders in the organisation, our customers, and working on programmes and initiatives that are critical to the success of the company.
  • Duties: ...
Posted
11 days ago
Undisclosed

KL City

  • •Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
  • •Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.
  • Technical Requirements ...
Posted
12 days ago
Undisclosed

Singapore

  • We believe that people from diverse backgrounds come together to do their best work, be their authentic selves, and build great things. We are proud to be an equal opportunity employer.
  • Support in the deployment, configuration, and maintenance of various high-end GPU servers, storage servers, networking equipment and software components in highly secure environments.
  • Perform hardware diagnostics, systems functionality and firmware updates as required. ...
Posted
12 days ago
Undisclosed

Singapore

  • Regular maintenance of production systems that host Vault products.
  • Contributing to the evolution of our SaaS products by building features that foster exceptional reliability and an unparalleled user experience.
  • Implementing and testing DR strategies to ensure the highest level of resilience and fault tolerance of the platform. ...
Posted
13 days ago
Undisclosed

KL City

  • Collaboration: Partner with Python development squads to ensure new features are designed with reliability in mind; conduct code reviews for reliability-critical paths; participate in Agile ceremonies.
  • Incident Management: Conduct root cause analysis for incidents and implement corrective actions to prevent recurrence; participate in on-call rotations for critical systems; maintain runbooks in version-controlled Python projects.
  • Continuous Improvement: Drive initiatives to improve system performance, reliability, and scalability through Python best practices, including profiling, benchmarking, and dependency management. ...
Posted
13 days ago
Undisclosed

KL City

  • Collaboration: Partner with Python development squads to ensure new features are designed with reliability in mind; conduct code reviews for reliability-critical paths; participate in Agile ceremonies
  • Incident Management: Conduct root cause analysis for incidents and implement corrective actions to prevent recurrence; participate in on-call rotations for critical systems; maintain runbooks in version-controlled Python projects
  • Continuous Improvement: Drive initiatives to improve system performance, reliability, and scalability through Python best practices, including profiling, benchmarking, and dependency management ...
Posted
13 days ago
Undisclosed

KL City

  • •Hands-on experience with monitoring tools and methodologies (e.g., Prometheus, Grafana).
  • •Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.
  • Technical Requirements ...
Posted
13 days ago
Undisclosed

Singapore

  • You will be responsible for System monitoring with real-time monitoring tools.
  • You will extend and acknowledge completion of handover milestones to Tiers I, II to comply with contractual SLAs.
  • You will be responsible for support operations tasks to shape the product roadmap and establish strong operational readiness across teams. ...
Posted
13 days ago
Undisclosed

Singapore

Posted
13 days ago
Undisclosed

Singapore

  • Experience with Monitoring Tools: Dynatrace, Splunk
  • Working knowledge of Java (1.8+)
  • Strong expertise in SQL and database troubleshooting (query optimization, performance tuning, and data analysis for incident resolution) ...
Posted
14 days ago
Undisclosed

Singapore

  • Systems monitoring
  • Automation and Infrastructure-as-Code
  • Plan and complete systems administration tasks on Linux and Windows systems such as application tuning, configuration management, security hardening and resource management (processors, memory, storage, networking) ...
Posted
14 days ago
SGD4,500 - SGD4,800 Per Month

Singapore

  • Guarantee the solution functionality and stability according to customer requirements.
  • Ensure the compliances with Thales deployment rules, and applied the best practices.
  • Assist the development and validation team during the project delivery. ...
Posted
15 days ago
SGD4,500 - SGD4,800 Per Month

Singapore

  • Guarantee the solution functionality and stability according to customer requirements.
  • Ensure the compliances with Thales deployment rules, and applied the best practices.
  • Assist the development and validation team during the project delivery. ...
Posted
15 days ago

Tata Consultancy Services Limited

SGD250 - SGD5,000 Per Month

Singapore

  • As part of Cloud Engineering Team, the SRE Engineer engages in and improves the full lifecycle of cloud platform solutions from design, deployment, operation and refinement with accuracy and in compliance with organization policies and security requirements.
  • The SRE Engineer treats operations as a software problem and therefore will code to automate repetitive tasks and optimize cloud operations.
  • Support services before go-live through activities like system design consulting, developing software platforms and launch reviews. Maintain post-live cloud operations by measuring and monitoring availability, latency and overall system health with any prompt and remediate actions. ...
Posted
16 days ago
SGD8,333 - SGD8,333 Per Month

Remote

  • Troubleshoot priority incidents, facilitate blameless post-mortems, and drive permanent resolutions.
  • Analyze incident trends and usage patterns to implement proactive solutions.
  • Design and implement self-healing and resiliency patterns to improve system uptime. ...
Posted
4 days ago
Undisclosed

Singapore

  • • Engineering-driven culture with strong investment in cloud infrastructure, stability, and platform scalability
  • Responsibilities:
  • • Ensure system reliability, scalability, and production stability across core business services ...
Posted
5 days ago
Undisclosed

KL City

  • Implement and enforce operational best practices: observability, logging, metrics, alerting, capacity planning, failover strategies, and backups.
  • Collaborate with Engineering, Product, Compliance, and Operations teams to ensure infrastructure meets reliability, compliance, and security standards.
  • Support service scaling, database operations, cloud infrastructure (GCP preferred), networking, and microservices orchestration. ...
Posted
5 days ago
Undisclosed

KL City

  • Manage cloud infrastructure provisioning and configuration using IaC tooling (Terraform, Helm), supporting both AWS/Azure cloud deployments and on-premises customer environments.
  • Implement and maintain CI/CD pipelines for GFS solutions (Jenkins, etc.)
  • Work with Engineering teams to ensure security and compliance readiness for Managed services — including PCI DSS, ISO 27001, SOC 1/2/3, PDPA/GDPR — in close coordination with InfoSec teams. ...
Posted
5 days ago
Undisclosed

KL City

  • Manage cloud infrastructure provisioning and configuration using IaC tooling (Terraform, Helm), supporting both AWS/Azure cloud deployments and on-premises customer environments.
  • Implement and maintain CI/CD pipelines for GFS solutions (Jenkins, etc.)
  • Work with Engineering teams to ensure security and compliance readiness for Managed services — including PCI DSS, ISO 27001, SOC 1/2/3, PDPA/GDPR — in close coordination with InfoSec teams. ...
Posted
5 days ago
Undisclosed

KL City

  • Collaboration: Partner with Python development squads to ensure new features are designed with reliability in mind; conduct code reviews for reliability-critical paths; participate in Agile ceremonies.
  • Incident Management: Conduct root cause analysis for incidents and implement corrective actions to prevent recurrence; participate in on-call rotations for critical systems; maintain runbooks in version-controlled Python projects.
  • Continuous Improvement: Drive initiatives to improve system performance, reliability, and scalability through Python best practices, including profiling, benchmarking, and dependency management. ...
Posted
5 days ago
Undisclosed

Singapore

  • Support cloud security operations, including cloud security alert management and compliance auditing.
  • 3+ years of DevOps or SRE experience; experience with AIOps or observability platform development is a plus.
  • Proficient in Python; familiar with at least one of Go or Java. Full-stack capability (React/Vue frontend + backend API) is a plus. ...
Posted
6 days ago
Undisclosed

KL City

  • Collaboration: Partner with Python development squads to ensure new features are designed with reliability in mind; conduct code reviews for reliability-critical paths; participate in Agile ceremonies.
  • Incident Management: Conduct root cause analysis for incidents and implement corrective actions to prevent recurrence; participate in on-call rotations for critical systems; maintain runbooks in version-controlled Python projects.
  • Continuous Improvement: Drive initiatives to improve system performance, reliability, and scalability through Python best practices, including profiling, benchmarking, and dependency management. ...
Posted
6 days ago
Undisclosed

KL City

  • Experience with CICD development & deployment tools such as Maven, Jenkins, Nexus, Git, and Docker.
  • Proficiency in Linux OS
  • Proficiency in scripting and automation (e.g. Python, PowerShell, YAML) with the ability to develop tools and infrastructure as code (Preferably Ansible, Terraform, Kubernetes, OpenShift). ...
Posted
7 days ago
Undisclosed
WFH

Malaysia

  • Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack
  • Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health
  • Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones ...
Posted
17 days ago
Undisclosed

KL City

  • Collaboration at its Best: Work closely with product teams, stakeholders, and global support. Immerse in and contribute to a rich tapestry of insights and expertise.
  • Mentorship and Growth: Guide budding engineers and share best practices, fostering a collective ascent.
  • Tech Evaluation: Regularly scrutinize platforms and apps, suggesting improvements rooted in data and hands-on experience ...
Posted
7 days ago
Undisclosed

Singapore

  • • Engineering-driven culture with strong investment in cloud infrastructure, stability, and platform scalability
  • Responsibilities:
  • • Ensure system reliability, scalability, and production stability across core business services ...
Posted
8 days ago
Undisclosed

Singapore

  • Improve team processes to meet business needs efficiently
  • Review services, assess implementations, and recommend improvements
  • Develop AI-based solutions to boost reliability, efficiency and productivity ...
Posted
8 days ago