We are seeking a Lead DevOps Engineer with expertise in incident and request management, and hands-on experience with monitoring and observability tools such as Dynatrace, Grafana, and Splunk.
This role focuses on monitoring setup, tool administration, and resolving medium complexity tickets, ensuring robust support and operational excellence across the organization.
Responsibilities
-
Develop and maintain documentation outlining best practices for logging and monitoring within the company
-
Conduct regular audits to verify logging and monitoring practices align with company policies and industry standards
-
Participate in cross-functional discussions and initiatives to promote logging and monitoring best practices throughout the organization
-
Manage monitoring, alerting, operability, and observability for applications using tools like Dynatrace, Splunk, and Grafana
-
Triage incoming tickets, update ticket details, and assess urgency for appropriate response
-
Review documentation to escalate tickets that require troubleshooting beyond Level 2 capabilities
-
Provide warm handoff notes for tickets escalated to higher support levels
-
Create and leverage documentation for handling standard incidents and requests
-
Define average completion time per ticket and establish Service Level Objectives (SLOs) for each product request type
-
Review and present metrics and escalated tickets regularly to document and improve the support process
-
Manage incidents and requests related to monitoring setup and tool administration, utilizing JIRA for ticket tracking
-
Be available for monitoring and escalation during off-hours, weekends, and carry pager duty for emergency situations
Requirements
-
Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or equivalent experience
-
Minimum 5 years of relevant professional experience in DevOps or related fields
-
At least one year of experience in people management or leading a team of 5 or more members
-
Strong understanding of observability, including monitoring, logging, and tracing practices
-
Hands-on experience with Dynatrace, Splunk, Grafana, and other monitoring and logging tools for application and infrastructure management
-
Experience with Azure logging and monitoring tools such as Log Analytics, Azure Monitor, and App Insights
-
Proven ability to operate high-availability, fault-tolerant, scalable, distributed software in production environments
-
Excellent oral and written communication skills in English at B2+ level or higher
We offer
-
International projects with top brands
-
Work with global teams of highly skilled, diverse peers
-
Healthcare benefits
-
Employee financial programs
-
Paid time off and sick leave
-
Upskilling, reskilling and certification courses
-
Unlimited access to the LinkedIn Learning library and 22,000+ courses
-
Global career opportunities
-
Volunteer and community involvement opportunities
-
EPAM Employee Groups
-
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn