We are building a Lead DevOps Engineer role to own and evolve the AWS platform behind a custom VDI solution and cloud playtesting/streaming services. You will drive infrastructure-as-code, ECS/EKS operations, AWS Lambda automation, and GitHub Actions CI/CD standards while optimizing GPU EC2 cost/performance and leading incident response across the platform. Apply now to help keep the platform reliable, efficient, and scalable
Responsibilities
-
Design, build, and maintain AWS infrastructure with Terraform
-
Manage Terraform workflows and remote state through HashiCorp Cloud Platform (HCP)
-
Own the end-to-end infrastructure lifecycle, including provisioning, upgrades, decommissioning, and operational hygiene
-
Operate ECS clusters to deploy and run microservices that support the platforms
-
Administer EKS clusters that host and enable GitHub Actions runners, including necessary platform customizations
-
Optimize and right-size GPU-enabled EC2 capacity to meet user experience goals under strict cloud cost controls
-
Assess scaling behavior continuously, monitor utilization, and identify performance bottlenecks
-
Implement and maintain AWS Lambda functions that automate cleanup tasks, on-demand provisioning, and operational workflows
-
Standardize and improve GitHub Actions pipelines for Terraform plan/apply workflows, infrastructure releases, and container image build/publish/deploy processes
-
Lead troubleshooting and service restoration for platform-wide degradations such as VDI session drops, authentication issues, and machine/storage failures
-
Coordinate incident resolution across teams by driving investigation, mitigation, and follow-up actions
-
Create and keep current run books, operational documentation, and onboarding materials
Requirements
-
Proven 7+ years of experience in DevOps or platform engineering roles
-
Deep expertise in AWS infrastructure architecture, provisioning, and full lifecycle management
-
Hands-on proficiency with Terraform and HashiCorp Cloud Platform (HCP)
-
Solid experience operating container orchestration using ECS and EKS
-
Strong knowledge of GPU-enabled EC2 right-sizing, cloud cost management, and performance tuning
-
Practical competency with AWS Lambda for event-driven automation
-
Demonstrated background standardizing CI/CD using GitHub Actions pipelines
-
Proven track record leading reliability engineering, troubleshooting, and incident resolution
-
High ownership and accountability with the ability to work independently without close supervision
-
Strong troubleshooting and systems thinking, staying calm and methodical during incidents
-
Clear communication skills with both technical and non-technical stakeholders
-
Effective prioritization in a Kanban workflow, balancing planned work with urgent interruptions
-
English proficiency at B2 (Upper-Intermediate) level or higher
Nice to have
-
Familiarity with Amazon GameLift Streams
-
Understanding of streaming and playtesting platform needs
-
Ability to triage urgent ad-hoc requests that fall outside the standard Kanban flow
We offer
-
International projects with top brands
-
Work with global teams of highly skilled, diverse peers
-
Healthcare benefits
-
Employee financial programs
-
Paid time off and sick leave
-
Upskilling, reskilling and certification courses
-
Unlimited access to the LinkedIn Learning library and 22,000+ courses
-
Global career opportunities
-
Volunteer and community involvement opportunities
-
EPAM Employee Groups
-
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
EPAM is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, age, sexual orientation, gender identity or expression, disability, protected veteran status, or any other characteristic protected by applicable law.