We are looking for a Chief HPC Network Engineer to define the global technical strategy, reference architecture, and engineering vision behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.
The role focuses on establishing the long-term technical direction, governing architecture decisions across multiple programs, and setting organization-wide engineering standards for high-performance network fabrics supporting massive-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering culture, mentor lead and principal engineers, influence executive client roadmaps, and own end-to-end governance of mission-critical network platforms across the portfolio.
The ideal candidate combines authoritative expertise across InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading multiple engineering teams, defining technical strategy at the program level, and shaping industry-leading HPC/AI network platforms.
Responsibilities
-
Define and own the multi-year strategic vision and architectural roadmap for high-performance InfiniBand/RDMA and Ethernet fabrics powering massive-scale GPU clusters and distributed AI/LLM workloads across the client portfolio
-
Govern the design, evaluation, and standardization of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and establish enterprise-wide decision frameworks aligned with workload scale, performance, and cost constraints
-
Establish and enforce organization-wide engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
-
Set the strategic direction for performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training workloads, and oversee resolution of the most complex systemic performance issues
-
Define the canonical reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and drive its adoption across programs
-
Own the strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases, and align adoption with the broader infrastructure roadmap
-
Define the enterprise observability strategy for network platforms, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies
-
Provide technical leadership and mentorship to lead and principal engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, building the talent pipeline and driving cross-functional alignment at scale
-
Act as the principal technical authority in executive client and stakeholder forums, shaping strategic technical direction, negotiating trade-offs at the program level, and ensuring delivery of reliable, scalable network platforms across multiple engagements
-
Contribute to the broader engineering community through thought leadership, internal practice development, and representation of the company at industry events
Requirements
-
8+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 4+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (2+ years)
-
Proven experience defining the architecture and governing delivery of InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-critical distributed compute environments
-
Authoritative expertise in host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with proven ability to set enterprise-wide standards and uplift engineering organizations
-
Deep understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, with the ability to drive workload-network co-design strategy at scale
-
Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures
-
Expert-level mastery of RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at very large scale
-
Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with the ability to define diagnostic methodologies for the broader engineering organization
-
Demonstrated ownership of enterprise network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
-
Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C-level client architecture decisions, and driving consensus across researchers, platform stakeholders, and executive sponsors
-
English language proficiency at an Advanced level (C1)
Nice to have
-
Hands-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies
-
Authoritative command of Grafana, Prometheus, and Network Administration, with experience defining observability standards across an engineering organization
-
Proven ability to define strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs
-
Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling organization-wide engineering productivity
-
Track record of thought leadership through conference talks, publications, patents, or open-source contributions in the HPC/AI networking domain