Job Summary
We are seeking a highly skilled and proactive Software Engineer (Reliability & Data Analytics) to design, develop, and maintain mission-critical systems with a strong focus on observability, system reliability, and data-driven operations. The ideal candidate will play a key role in building automated monitoring solutions, improving system uptime, and enabling predictive insights through advanced analytics on operational data.
This role requires strong engineering fundamentals, experience with distributed systems, and a passion for solving complex production issues in high-availability environments.
Key Responsibilities:
- Unified Observability & Predictive Analytics
- Design, develop, and deploy automated monitoring and self-healing systems to proactively detect and resolve operational issues.
- Analyse system logs, metrics, and performance data to generate actionable insights and predictive alerts.
- Build and maintain real-time observability dashboards for monitoring the health of critical infrastructure and applications.
- Collaborate with engineering and SRE teams to improve system visibility and telemetry coverage.
System Reliability & Operational Support
- Lead Root Cause Analysis (RCA) during production incidents and implement long-term code-level fixes.
- Work closely with Application Support and Site Reliability Engineering (SRE) teams to ensure high availability of 24/7 mission-critical systems.
- Identify system bottlenecks, single points of failure, and performance inefficiencies.
- Drive continuous improvement initiatives to enhance system stability, scalability, and performance.
Technical Competencies: Core Skills (Must-Have)
- Programming Languages: Strong proficiency in Python or Go (automation/operations focus), and Java or C# (.NET).
- APIs & Integration: Experience building and consuming RESTful APIs and SOAP services; familiarity with message brokers such as Kafka or RabbitMQ.
- Databases: Hands-on experience with relational databases (PostgreSQL, Oracle) and NoSQL databases.
- Cloud & DevOps: Experience with Docker, Kubernetes, and CI/CD pipelines (Jenkins, GitLab CI, or similar).
Good to Have
- Experience with observability and monitoring tools such as Splunk, Datadog, or New Relic.
- Exposure to data analytics and visualization tools such as Grafana or Tableau.
- Familiarity with data processing frameworks and logging stacks such as ELK (Elasticsearch, Logstash, Kibana) or Python-based analytics (pandas).
Qualifications
- Bachelor’s degree in Computer Science, Software Engineering, or a related field.
- 2–5+ years of experience in software engineering, SRE, backend development, or data-driven engineering roles.
- Strong understanding of distributed systems, production support, and system design principles.
- Experience working in high-availability or mission-critical environments is a plus.
Key Competencies
- Strong analytical and problem-solving skills
- Ability to perform under pressure in production incident scenarios
- Excellent debugging and troubleshooting skills
- Strong collaboration and communication skills
- Proactive mindset with focus on automation and prevention
Preferred Experience
- Experience in SRE, platform engineering, or reliability engineering roles
- Exposure to fintech, aviation, telecom, or other high-scale systems
- Experience with real-time streaming systems and event-driven architectures
Work Location
In-person
Pay: RM5,000.00 - RM5,500.00 per month
Work Location: In person