700+ Reliability Jobs - June 2026 - High Salaries

Showing 707 jobs results for "reliability"

Never miss any updates for Reliability jobs

Undisclosed

KL City

  • Bachelor's degree or higher in Computer Science or a related field, with a minimum of 2 years of experience in SRE or platform operations; experience in the gaming industry is a plus.
  • Proficient in Unix/Linux operating systems, with hands-on experience in Shell/Python scripting; development experience is preferred.
  • Solid experience in managing public cloud services (e.g., AWS, GCP), proficient with Kubernetes and its ecosystem, and skilled in MySQL/Redis and related technologies. ...
Posted
10 days ago
Undisclosed

Singapore

  • Design maintenance programs to maximize equipment operability and efficiency while minimizing life cycle costs.
  • Understand potential equipment failures and provide full technical support Facility Operations teams in the event of a critical system failure.
  • Conduct root cause analysis reviews with site teams and key OEM vendors and develop a corrective action program. ...
Posted
a month ago
SGD8,000 - SGD8,000 Per Month

Singapore

  • You should have solid hands-on experience with Kubernetes and container orchestration, along with familiarity with CI/CD tooling such as GitLab, Jira, Confluence, Fortify, or similar tools in the DevSecOps space.
  • Strong proficiency in at least one scripting or programming language — Python, Go, or Bash — is expected, as is experience with infrastructure-as-code tools like Terraform or Ansible.
  • You should be comfortable working with cloud platforms (AWS, Azure, or GCP) and have experience setting up and managing observability stacks such as the ELK stack, Prometheus, Grafana, or equivalent. ...
Posted
23 days ago
Undisclosed

Singapore

  • Participate in building tools to enhance the observability and automation of data services.
  • Document standard operating procedures and support knowledge sharing across the team.
  • Currently pursuing a Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field. ...
Posted
17 days ago
Undisclosed

Singapore

  • Create smart traffic management that can handle viral video surges without breaking a sweat
  • Build tools to spot and fix issues before they reach users
  • Keep TikTok running smoothly across continents and time zones ...
Posted
17 days ago
Undisclosed
  • Data Interpretation and Reporting: Analyze from raw data and generate detailed reports for engineering teams.
  • Cross functional Collaboration: Work closely with engineers and also upstream/downstream colleagues and ensure smooth operation without miscommunication.
  • Diploma in any Engineering or science discipline ...
Posted
a month ago
Undisclosed

Malaysia

  • Hands-on experience (tester setup) with automated test equipment (tester) and environmental chambers focusing on Temperature Cycling Test (TCT), Temperature Humidity, Highly Accelerated Stress Test (HAST) or other relevant equipment.
  • Develop reliability test capability /test programs/ testing methods for new products or technologies.
  • Involve in tester development & improvement. ...
Posted
18 days ago
SGD6,000 - SGD6,000 Per Month

Singapore

  • Collaborate closely with cross-functional engineering and infrastructure teams to ensure operational readiness and platform stability
  • Design and implement robust monitoring frameworks, intelligent alerting systems, and incident response processes to achieve operational excellence
  • Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure and improve system reliability ...
Posted
a month ago
Undisclosed

Singapore

  • Automation & Tooling: Develop and implement automation tools and scripts using Python to reduce manual operational tasks and improve efficiency. This includes automating health checks, operational tasks, and contributing to CI/CD pipelines.
  • Monitoring & Observability: Implement and enhance monitoring, alerting, and logging systems to ensure comprehensive visibility into application health and performance. Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Collaboration & Communication: Act as a primary point of contact for users, effectively communicating status updates and resolution plans during live issues. Collaborate closely with development, infrastructure, and other technology teams, as well as external vendors, to drive issue resolution and system enhancements. ...
Posted
6 days ago
Undisclosed

Singapore

  • Automation & Tooling: Develop and implement automation tools and scripts using Python to reduce manual operational tasks and improve efficiency. This includes automating health checks, operational tasks, and contributing to CI/CD pipelines.
  • Monitoring & Observability: Implement and enhance monitoring, alerting, and logging systems to ensure comprehensive visibility into application health and performance. Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Collaboration & Communication: Act as a primary point of contact for users, effectively communicating status updates and resolution plans during live issues. Collaborate closely with development, infrastructure, and other technology teams, as well as external vendors, to drive issue resolution and system enhancements. ...
Posted
6 days ago
Undisclosed
  • Conduct root cause analysis reviews with site teams and key OEM vendors and develop a corrective action program.
  • Provide systems reliability and maintainability feedback to the Design Engineering teams for future design considerations.
  • Work with Design Engineering and Construction teams to ensure the reliability and maintainability of new and modified installations. ...
Posted
a month ago
Undisclosed

Singapore

  • Automation & Tooling: Develop and implement automation tools and scripts using Python to reduce manual operational tasks and improve efficiency. This includes automating health checks, operational tasks, and contributing to CI/CD pipelines.
  • Monitoring & Observability: Implement and enhance monitoring, alerting, and logging systems to ensure comprehensive visibility into application health and performance. Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
  • Collaboration & Communication: Act as a primary point of contact for users, effectively communicating status updates and resolution plans during live issues. Collaborate closely with development, infrastructure, and other technology teams, as well as external vendors, to drive issue resolution and system enhancements. ...
Posted
7 days ago
Undisclosed

Singapore

  • Embed Design-for-Reliability (DfR) principles into all mechanical design and deployments, conducting extensive Failure Mode and Effects Analysis (FMEA) during the design and delivery phases.
  • Establish predictive maintenance frameworks utilizing telemetry and Building Management System (BMS) data to preempt mechanical degradation and optimize PUE and WUE.
  • Lead post-incident lifecycle after root cause analysis (RCA) for any mechanical anomalies has been established , ensuring findings are fed back into future system designs. ...
Posted
a month ago
Undisclosed

台灣

  • Monitoring the health status of each service.
  • Coworker with the development team and PM team.
  • Positive attitude and a strong commitment to delivering quality work. ...
Posted
a month ago

台灣積體電路製造股份有限公司

Undisclosed

台灣

Posted
a month ago
Undisclosed

Singapore

  • In this role, you will join our Payments Network SRE team, where you’ll be responsible for continuously assessing and improving the service quality of our F5 load balancer operations and a diverse range of networking technologies. You will provide strategic insights to key stakeholders on optimal resource utilization, capacity forecasting, and performance trends—helping to ensure availability, scalability, and resilience across our network.
  • Key Responsibilities:
  • Lead continuous assessments of our F5 and network infrastructure supporting critical Mastercard applications, focusing on health, performance, and capacity analysis. Collaborate with Product and Development teams to forecast growth requirements and ensure scalability. ...
Posted
a month ago
Undisclosed

Malaysia

  • Logistics shipment handling IN/Out from laboratory to ensure in time delivery and fulfilment cycle time expectation
  • Reliability engineering support responsibilities may be assigned on an as needed basis
  • Participation in RCCA efforts and assist engineer in reliability activities ...
Posted
a month ago
Undisclosed

KL City

  • Soft Skills: Strategic thinking, exceptional communication, and the ability to collaborate effectively with cross-functional teams in a fast-paced environment.
  • Coding: Proficient in at least one high-level programming language (e.g., Python, Go, C++, or Java) and shell scripting. Strong understanding of data structures and algorithms.
  • Systems: Strong understanding of Linux operating systems and open-source technologies and a solid understanding of network architecture. ...
Posted
a month ago
Undisclosed

Singapore

  • Deliver a playbook for onboarding new tasks / activities covering both Application and Infrastructure support models
  • Identify opportunities to automate Production support activities (App & Infra) and reduce manual interventions
  • Drive application and infrastructure improvements including performance, capacity, resilience, and operational stability; eliminate toil through automation ...
Posted
16 days ago
Undisclosed

Singapore

  • Deliver a playbook for onboarding new tasks / activities covering both Application and Infrastructure support models
  • Identify opportunities to automate Production support activities (App & Infra) and reduce manual interventions
  • Drive application and infrastructure improvements including performance, capacity, resilience, and operational stability; eliminate toil through automation ...
Posted
16 days ago
Undisclosed

Singapore

  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
  • Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently
  • Be part of the global team roster that ensures system and business on-call support ...
Posted
a month ago
Undisclosed

Singapore

  • Bachelor's degree or above in a computer-related field.
  • Experience in Big Data SRE operations or technical support for toB (business-facing) products.
  • Familiarity with one or more open-source components, such as Hadoop, Spark, Flink, Hive, Presto/Trino, Doris, Kafka, HBase, Hudi, ClickHouse, etc. ...
Posted
a month ago
Undisclosed

Malacca City

  • Implement equipment and process adaptation according to new requirements from the roadmaps. Work package member for the project team. Deliver the know-how to the Engineering sustaining team.
  • Plan and manage Technical Feasibility Check on new reliabilityprojects and provide feedback to clients/lab logistic planner.
  • Liaise internally with the Operation/Engineering sustaining team and prioritize achieving the agreed timeline, including milestone and deadline reliability. ...
Posted
a month ago
MYR10,000 - MYR12,500 Per Month
  • Build and maintain CI/CD pipelines using tools like Jenkins, GitHub, Bitbucket, or Bamboo
  • Implement Infrastructure as Code (Terraform, Ansible, Chef) for automation
  • Develop scripts/tools using Python, Go, or Java to improve system efficiency ...
Posted
a month ago
Undisclosed

KL City

  • Define and implement SLOs/SLIs/SLAs; create dashboards and alerting to track service health (availability, latency, errors, saturation).
  • Lead sustainable incident response: triage, mitigation, root-cause analysis (RCA), and blameless postmortems with actionable follow-ups.
  • Collaborate with software engineering, security, and compliance stakeholders to meet data governance and regulatory requirements. ...
Posted
a month ago
Undisclosed

George Town

  • Define and implement SLOs/SLIs/SLAs; create dashboards and alerting to track service health (availability, latency, errors, saturation).
  • Lead sustainable incident response: triage, mitigation, root-cause analysis (RCA), and blameless postmortems with actionable follow-ups.
  • Collaborate with software engineering, security, and compliance stakeholders to meet data governance and regulatory requirements. ...
Posted
a month ago
Undisclosed

Singapore

  • Help implement and maintain secure access controls and compliance practices
  • Contribute to automation and continuous improvement of infrastructure and operations
  • Work with engineering teams to improve system stability and operational efficiency ...
Posted
a month ago
Undisclosed
WFH

Singapore

  • Improve system observability and incident response capabilities
  • Optimize operational costs while maintaining high availability and performance
  • Architect and implement AWS infrastructure using Infrastructure as Code principles ...
Posted
a month ago
Undisclosed

Singapore

  • Deliver a playbook for onboarding new tasks / activities covering both Application and Infrastructure support models
  • Identify opportunities to automate Production support activities (App & Infra) and reduce manual interventions
  • Drive application and infrastructure improvements including performance, capacity, resilience, and operational stability; eliminate toil through automation ...
Posted
16 days ago