The Principal Lead (Platform Engineering) is responsible for leading the centralized platform engineering function across our enterprise ecosystem, encompassing Cloud Engineering, Site Reliability Engineering (SRE), and Quality Engineering. This role combines multi-team technical leadership, platform strategy ownership, and operational governance to deliver a secure, reliable, observable, and AI-ready engineering platform that supports our diverse product portfolios and operational outcomes.
The Principal Lead (Platform Engineering) sets the platform strategy, defines engineering standards across infrastructure, reliability, and quality, and leads the transformation of our platform discipline from traditional to AI-native engineering. The role ensures the platform is fully equipped to host and operate AI and agentic workloads safely at production scale, while owning the operating model that allows our distributed engineering and product teams to ship faster without compromising reliability, security, cost discipline, or governance.
Responsibilities
- Chapter Leadership: Lead, mentor, and grow three distinct engineering chapters: Cloud Engineering, Site Reliability Engineering (SRE), and Quality Engineering.
- AI Transformation: Lead the evolution of the platform discipline into an AI-native engineering model, establishing new standards, operating frameworks, and team capabilities across the organization.
- Governance & Standards: Own platform-wide reliability, quality, security, and FinOps standards; act as the final escalation point for production readiness and deployment decisions.
- Agentic Operating Model: Partner with the AI Center of Excellence (COE) to define and operationalize the agentic operating model (AgentOps), covering agent lifecycle, evaluation, observability, and safe production rollout.
- Shift-Left Security: Partner with Security teams to operate CNAPP and DevSecOps tooling (SAST, SCA, secret scanning, container image scanning, SBOM, policy-as-code) and embed shift-left practices across the SDLC.
- Data-Driven Optimization: Drive continuous improvement across platform performance, reliability, quality, and cloud spend using data, KRIs, and DORA-aligned metrics.
- Ecosystem Alignment: Ensure consistent platform engineering practices and standards across all corporate subsidiaries, independent business units, and vendor-delivered systems.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Software Engineering, Information Systems, or equivalent practical experience.
- Significant senior leadership experience directing platform, cloud, SRE, or large-scale engineering functions within complex production environments.
- Proven experience leading multiple engineering chapters simultaneously and managing chapter/team leads.
- Deep technical expertise across cloud platforms, Kubernetes, CI/CD pipelines, observability architectures, reliability engineering, and quality engineering frameworks.
- Strong understanding of the AI / agentic operating model: agent lifecycle, evaluation, observability, safety, and production rollout patterns.
- Working knowledge of modern AI development tools and methodologies (e.g., Claude Code, Codex, Spec Kit, BMAD) and their application to modernizing platform engineering practices.
- Hands-on or strong working knowledge of:
- Cloud platforms (GCP and/or AWS), Kubernetes, enterprise networking, and IaC (Terraform).
- Observability and SRE tooling (e.g., Datadog), SLOs, error budgets, and incident management.
- CI/CD, source control, quality gates, and DevSecOps controls (SAST, SCA, secrets, container scanning, SBOM).
- Familiarity with modern enterprise tooling ecosystems:
- Datadog, Prometheus, Grafana, or equivalent observability platforms.
- Terraform, Harness, feature flag, and progressive delivery tooling.
- CNAPP / security scanning tooling, SonarQube, and policy-as-code frameworks.
- Deep experience operating cloud-native environments at a high production scale.
- Strong understanding of DevSecOps, modern SDLC paradigms, DORA metrics, and progressive platform engineering practices.
- Robust understanding of resilience engineering concepts, including chaos testing, RTO/RPO targets, multi-region failover, graceful degradation, and disaster recovery planning.