Senior Site Reliability Engineer
Location: Singapore
Employment Type: Full-Time
Ignite Search is a specialist recruitment company within the Technology & Renewables space, located in Singapore.
Our client is a high-growth technology company operating in the online safety and compliance space. With a strong international presence and a rapidly expanding client base, they are looking for a Senior Site Reliability Engineer to join their engineering team and help scale their platform reliably as the business grows.
Key Responsibilities:
- Own the reliability, availability, and performance of the platform and public APIs
- Design and improve scalable infrastructure on AWS and Kubernetes to support high-growth, uneven global traffic
- Build and maintain strong observability across logs, metrics, tracing, alerting, and service health so issues are caught early and investigated quickly
- Improve deployment safety through CI/CD workflows, release controls, rollback paths, and environment consistency
- Drive incident response and production readiness practices including runbooks, on-call hygiene, postmortems, capacity planning, and resilience testing
- Reduce operational toil by automating repetitive work and improving internal developer tooling
- Partner with engineering teams to embed reliability and operability into service design from the outset — not after something fails in production
- Strengthen platform security and infrastructure hygiene across access controls, secrets handling, and system hardening
- Continuously improve system performance, resource efficiency, and cost awareness without compromising reliability
Requirements:
- 5+ years of experience in infrastructure, platform engineering, site reliability engineering, or software engineering with meaningful production ownership
- Strong hands-on experience running production systems in AWS
- Proven experience with Kubernetes and container-based workloads
- Experience with infrastructure as code, preferably Terraform
- Experience designing and operating observability stacks using tools such as Prometheus, Alertmanager, Grafana, OpenTelemetry, or equivalent
- Strong understanding of distributed systems, failure modes, service reliability, and production debugging
- Experience building or improving CI/CD systems and release workflows in modern engineering environments
- Ability to write code and automation in one or more languages: Go, Python, or TypeScript
- Good judgment during incidents with a practical mindset around trade offs, risk, and recovery
- Clear written and verbal communication skills with the ability to work effectively in a remote team
- Startup experience is a plus, particularly in environments where systems and processes are still being built
Application
If you are interested in this position, please apply directly on the platform with your latest CV. We will review your application and revert back promptly.