- Islandwide (Singapore) Singapore

工作地点
职位描述
岗位职责
Responsibilities
Incident & Application Support
Provide second-line (L2) support for production and staging systems, handling
escalations from L1 Support.
Investigate application errors, system alerts, performance degradation, and
integration issues.
Restore services within agreed SLA/OLA timelines and ensure proper incident
closure.
Troubleshooting & Root Cause Analysis
Perform in-depth troubleshooting using logs, metrics, and monitoring tools.
Conduct root cause analysis (RCA) for recurring or high-impact incidents.
Propose and implement corrective and preventive actions to reduce incident
recurrence.
Collaboration & Escalation
Work closely with L3 engineers, DevOps, and vendors to resolve complex technical
issues.
Provide clear technical findings, logs, and evidence when escalating issues.
Participate in incident bridges, post-incident reviews, and operational discussions.
Operational Excellence
Monitor system health, alerts, dashboards, and logs to proactively identify issues.
Execute approved configuration changes, patches, and operational fixes.
Support deployment, release, and maintenance activities when required.
Contribute to automation of operational tasks, monitoring, and alerting where
applicable.
Identify gaps in runbooks, SOPs, and operational processes and drive improvements.
Documentation
Maintain and update runbooks, troubleshooting guides, and knowledge base
articles.
Document incident resolutions and operational procedures clearly and accurately.
Security & Compliance
Adhere to security, access control, and compliance requirements.
Handle sensitive information in logs, tickets, and systems appropriately.
Support audits, vulnerability remediation, and compliance checks when required.
Key Experiences and Qualifications We Seek
Educational Background:
Diploma or higher in Computer Science, Information Technology, or a related field.
Professional Experience:
3–5+ years of relevant experience in application support, systems support, or
operations roles.
Experience supporting production systems in a high-availability or mission-critical
environment.
Technical Expertise:
Strong hands-on experience with:
Application log analysis and monitoring tools (e.g. AWS CloudWatch, Grafana, ELK, Google Analytics, etc)
Linux/Unix environments
Working knowledge of cloud platforms (e.g. AWS services such as ECS, Lambda, S3, RDS).
Basic database knowledge (MySQL, PostgreSQL) for health checks and simple queries.
Basic knowledge on REST APIs, system integrations and authentication design
Understanding of incident, problem, and change management processes.
Problem-Solving Skills:
Strong analytical and troubleshooting skills.
Ability to break down complex incidents into clear, actionable steps.
Calm and methodical approach when handling production issues under pressure.
Operational Practices:
Familiarity with ticketing and incident management tools (e.g. Jira, PagerDuty).
Experience working with runbooks, SOPs, and on-call support rotations (if applicable).
Additional Skills (Bonus Points):
Experience supporting cloud-native or microservices-based systems.
Basic scripting skills (e.g. Bash, Python) for automation.
Experience working in government, regulated, or large-scale enterprise environments.
Knowledge of disaster recovery and business continuity planning.
Character Traits We Look Out For
Team player with a collaborative mindset
Strong sense of ownership and accountability for system reliability
Proactive in identifying and addressing operational issues
Willingness and ability to learn and adapt to new systems and tools
Openness to sharing knowledge and improving team capability
Clear verbal and written communication skills, including incident reporting
重要安全守则
申请工作时,切勿提供您的银行或信用卡详细资料。不要转账或完成无关的在线调查问卷。如果您发现可疑内容,请举报此招聘广告。