Chief HPC Network Engineer - AI Infrastructure

EPAM Systems, Inc. -
Desde casa

Postúlate ahora

Información del empleo

Cualificaciones

Azure
Administración de personal
Derecho
Kubernetes
Tooling
Firmware
UNIX
Inglés
Ethernet
Linux
Habilidades de comunicación
Python
Shell Scripting

Descripción completa del empleo

We are looking for a Chief HPC Network Engineer to define the global technical strategy, reference architecture, and engineering vision behind advanced AI, research, and Kubernetes-based GPU infrastructure for a major global technology client.

The role focuses on establishing the long-term technical direction, governing architecture decisions across multiple programs, and setting organization-wide engineering standards for high-performance network fabrics supporting massive-scale LLM and distributed AI workloads, including InfiniBand/RDMA, high-speed Ethernet, Kubernetes networking, host-side GPU networking, SmartNIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering culture, mentor lead and principal engineers, influence executive client roadmaps, and own end-to-end governance of mission-critical network platforms across the portfolio.

The ideal candidate combines authoritative expertise across InfiniBand NDR/HDR and next-generation fabrics, RDMA/RoCE, NVIDIA/Mellanox networking, NCCL/MSCCL communication patterns, Linux host networking, PCIe/GPU/NIC topology, and Kubernetes networking for GPU clusters, with a proven track record of leading multiple engineering teams, defining technical strategy at the program level, and shaping industry-leading HPC/AI network platforms.

Responsibilities

Define and own the multi-year strategic vision and architectural roadmap for high-performance InfiniBand/RDMA and Ethernet fabrics powering massive-scale GPU clusters and distributed AI/LLM workloads across the client portfolio
Govern the design, evaluation, and standardization of cluster network topologies, including Fat-tree, Clos, Rail-optimized, and Dragonfly, and establish enterprise-wide decision frameworks aligned with workload scale, performance, and cost constraints
Establish and enforce organization-wide engineering standards and best practices for host-side networking, including NIC configuration, drivers, firmware, IRQ affinity, NUMA placement, PCIe topology, and GPU-to-NIC communication paths
Set the strategic direction for performance engineering across RDMA/RoCE, NCCL/MSCCL, and collective communication for multi-node GPU training workloads, and oversee resolution of the most complex systemic performance issues
Define the canonical reference architecture for Kubernetes networking on GPU clusters, including CNI plugins, network policies, multi-NIC pods, RDMA/GPU device plugins, and workload orchestration integration, and drive its adoption across programs
Own the strategy and governance for SmartNIC/DPU technologies such as NVIDIA BlueField, including SR-IOV, offload, isolation, and security use cases, and align adoption with the broader infrastructure roadmap
Define the enterprise observability strategy for network platforms, governing metrics, dashboards, alerts, congestion detection, latency tracing, SLO frameworks, and capacity/performance analysis methodologies
Provide technical leadership and mentorship to lead and principal engineers across network, Kubernetes, storage, GPU infrastructure, observability, and AI research teams, building the talent pipeline and driving cross-functional alignment at scale
Act as the principal technical authority in executive client and stakeholder forums, shaping strategic technical direction, negotiating trade-offs at the program level, and ensuring delivery of reliable, scalable network platforms across multiple engagements
Contribute to the broader engineering community through thought leadership, internal practice development, and representation of the company at industry events

Requirements

8+ years of experience in network, infrastructure, HPC, SRE, or similar engineering roles, with 4+ years focused on HPC, AI/ML, or GPU cluster networking, including demonstrated technical leadership at the program or portfolio level (2+ years)
Proven experience defining the architecture and governing delivery of InfiniBand/RDMA fabrics, high-speed Ethernet, and Linux networking in large-scale, performance-critical distributed compute environments
Authoritative expertise in host-side networking, including NICs, drivers, and firmware, along with PCIe topology, NUMA awareness, and GPU-to-NIC affinity, with proven ability to set enterprise-wide standards and uplift engineering organizations
Deep understanding of distributed AI training communication patterns, including NCCL-based workloads and collective operations such as all-reduce and all-gather, with the ability to drive workload-network co-design strategy at scale
Authoritative knowledge of Kubernetes and container networking for GPU or distributed workloads, including CNI concepts, network policies, multi-NIC patterns, and RDMA/GPU device integration, with experience defining reference architectures
Expert-level mastery of RDMA networking concepts, including InfiniBand, RoCE/RoCEv2, GPUDirect-related patterns, congestion behavior, and performance tuning at very large scale
Mastery of Linux networking and host-side troubleshooting, including IRQ affinity, MTU, offloads, and performance diagnostics, with the ability to define diagnostic methodologies for the broader engineering organization
Demonstrated ownership of enterprise network observability and performance management strategy, including telemetry, traffic monitoring, congestion detection, latency analysis, SLOs, capacity planning, and alerting/troubleshooting across L1-L4, fabric, and RDMA layers
Outstanding leadership, mentoring, stakeholder management, and executive communication skills, with proven experience leading multiple engineering teams, influencing C-level client architecture decisions, and driving consensus across researchers, platform stakeholders, and executive sponsors
English language proficiency at an Advanced level (C1)

Nice to have

Hands-on architectural and strategic experience with Azure Networking, Ethernet, and GPGPU/GPU technologies
Authoritative command of Grafana, Prometheus, and Network Administration, with experience defining observability standards across an engineering organization
Proven ability to define strategy, govern, and scale Infrastructure as Code practices across multiple teams and programs
Proficiency in Python and UNIX shell scripting for automation, tooling, and enabling organization-wide engineering productivity
Track record of thought leadership through conference talks, publications, patents, or open-source contributions in the HPC/AI networking domain

Postúlate ahora

Herramientas para candidatos

Herramientas para empresas

Explorar

Mantente conectado