A Site Reliability Engineer (SRE) combines software engineering and systems engineering to build, maintain, and optimize large-scale, highly available, and fault-tolerant systems. This role focuses on automation, infrastructure reliability, system scalability, and operational excellence in a fast-paced BPO environment supporting infrastructure services and AML systems.
Key Responsibilities
Design, build, and maintain scalable, highly available, and fault-tolerant systemsCollaborate with software engineering teams to improve reliability and system performanceDevelop automation procedures to reduce manual intervention and improve operational efficiencyMonitor infrastructure performance and proactively identify system bottlenecksImplement and maintain monitoring tools, automated alerts, SLIs, SLOs, and SLAsParticipate in 24/7 on-call rotations, including scheduled shifts and holidaysConduct root-cause analysis and lead blameless post-mortems to prevent recurring issuesEnsure systems comply with security standards and regulatory requirements
...