- Kuala Lumpur Federal Territory Malaysia
工作地点
职位描述
岗位职责
At Provido Global, we’re more than a technology company. We are a global hub of innovation, creativity, and engineering excellence.
Our teams design and deliver intelligent, secure, and high-performance digital solutions that help organizations modernize operations, scale their platforms, and succeed in an increasingly digital world.
As part of a dynamic international ecosystem, we bring together forward-thinking engineers, technology specialists, designers, and delivery professionals who transform ideas into scalable, real-world solutions with measurable business impact.
If you are motivated by challenge, inspired by technology, and ready to grow with a company that truly invests in its people, your journey starts here.
What You’ll Be Doing
We are looking for a detail-oriented and experienced Site Reliability Engineer to join our team. The Site Reliability Engineer will be responsible for creating and implementing scalable solutions to meet system and application performance goals. You will also be responsible for troubleshooting system errors and resolving any relevant issues.
Implement monitoring solutions to track system health, performance, and availability. Proactively monitor systems, identify issues, and respond to incidents promptly, working to minimize downtime and mitigate impacts
SREs drive continuous improvement efforts by identifying areas for enhancement, implementing best practices, and fostering a culture of reliability engineering. Participate in post-mortems, conduct blameless retrospectives, and drive initiatives to improve system reliability, stability, and maintainability
SREs collaborate closely with software engineers, operations teams, and other stakeholders to ensure smooth coordination and effective communication. They share knowledge, provide technical guidance, and contribute to the development of a strong engineering culture
Implement comprehensive service monitoring, including dashboards, metrics, and alerts
Define, measure, and meet key Service Level Objectives (SLOs),supported by Service Level Indicators (SLIs), including uptime, performance, incidents, and chronic problems
Partner with application and business stakeholders to ensure high quality product development and release
Collaborate with the development team to enhance system reliability and performance
Automate repetitive tasks and operational processes to reduce manual toil and increase efficiency
What You Bring to the Team
Bachelor’s degree in information technology, Computer Science, or related field
Must be flexible and willing to support a rotating schedule, providing 24/7 coverage as part of a shared team responsibility
Strong problem-solving abilities
Excellent understanding of computer systems, servers, and network systems
Ability to work under pressure and manage multiple tasks simultaneously
Strong communication and interpersonal skills
Basic understanding of programming concepts (structured and object-oriented) using high-level languages such as Python, Java, C#, or JavaScript
Experience with distributed storage technologies such as Amazon S3 and related, as well as dynamic resource management frameworks (Kubernetes, Yarn)
Experience with cloud computing platforms such as AWS and Azure
Experience with DevOps tools such as Git, Terraform, Docker, or related
Experience with monitoring tools such as, Grafana, ELK Stack, Prometheus, or related
Preferred Skills
Experience in SRE, DevOps, or Systems Engineering roles, with strong Linux and cloud (AWS, Azure, or GCP) background
Proficiency in scripting (e.g., Python, Bash) and working with tools like Docker, Kubernetes, and Terraform
Familiarity with CI/CD pipelines and observability tools such as Grafana, Prometheus, or ELK stack
重要安全守则
申请工作时,切勿提供您的银行或信用卡详细资料。不要转账或完成无关的在线调查问卷。如果您发现可疑内容,请举报此招聘广告。