The Operations Manager leads a team of Shift Leads and L1 Operations Engineers, owns service
management and ITSM maturity, drives vendor and customer operational relationships, and
ensures the organization operates as a predictable, scalable, and audit-ready service provider.
KEY RESPONSIBILITIES
Operations Leadership & Service Ownership
- Lead, manage, and develop a team of approximately 12 operations staff (Shift Leads and L1 Engineers).
- Own day-to-day service delivery for GPU infrastructure and supporting platforms.
- Be accountable for customer-facing SLA performance.
- Establish a culture of operational excellence, accountability, and ownership.
- Ensure operations are executed consistently across all shifts.
Incident, SLA & Service Governance
- Own incident management, major incident management, change management, and problem management processes.
- Ensure correct incident prioritization, escalation, and resolution.
- Track SLA performance and breach risk, and drive mitigation actions.
- Lead post-incident reviews and corrective action tracking.
- Ensure shift handovers, on-call coverage, and escalation models are robust.
ITSM, Automation & Operational Visibility
- Own Jira Service Management operating model.
- Define and enforce ticket lifecycle standards and data quality.
- Drive automation of workflows, approvals, and notifications.
- Define and maintain operational dashboards for leadership visibility.
- Ensure ITSM data is accurate and reliable.
Customer, Vendor, & Compliance
- Act as primary operational contact for customers on service-related matters.
- Drive professional, timely, and transparent customer communication.
- Own operational engagement with GPU, fabric, rack, and DC facility vendors.
- Track vendor SLA performance and lead escalations.
- Ensure operations meet ISO / SOC and internal compliance requirements.
- Own operational risk register and mitigation plans
DESIRED QUALIFICATION AND SKILLS
- Bachelor’s degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent practical experience will be considered.
- 10+ years of experience in IT infrastructure or data center operations.
- 5+ years leading operations teams in 24x7 environments.
- Strong ITSM and service management background.
- Experience owning SLAs and service delivery.
- Experience with Jira Service Management or similar.
- Familiarity with GPU platforms AI / HPC environments.
- Proven experience designing, documenting, and implementing operational processes, SOPs, runbooks, and operational policies.
- Strong people leadership, communication, and stakeholder management skills