Job Overview
We are seeking an experienced Sr. Principal Engineer, Site Reliability (SRE) to drive technical excellence within our global Site Reliability Engineering organization. This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. The successful candidate will provide hands-on technical expertise and strategic technical direction in incident response, system optimization, and reliability engineering practices across our complex technology stack. Off hours support as needed
About Us
ICIMS is a leading enterprise hiring platform that combines the scale and reliability of enterprise software with the transformative power of AI. Thousands of organizations across more than 200 countries and territories trust ICIMS to find and hire the people who shape their future and drive their business forward. Powered by insights from billions of hiring interactions, continuous AI innovation, and a highly extensible platform, ICIMS helps organizations turn talent acquisition into a competitive advantage. For more than 25 years, ICIMS has delivered end-to-end hiring solutions that improve recruiting efficiency, reduce costs and create exceptional candidate experiences.
ICIMS helps solve one of the biggest challenges businesses face today: building a workforce that can adapt, scale, and perform in an increasingly competitive and unpredictable talent market. We uniquely do that by combining enterprise-grade hiring technology, AI-powered insights and automation, and connected talent experiences to help organizations improve hiring outcomes while driving measurable impact.
Responsibilities
Technical Leadership
- Provide strategic technical direction for a team of 5+ SRE engineers across one or more geographic regions (US, Ireland, or India)
- Provide technical mentorship and guidance for team members
- Drive technical decision-making for complex reliability and performance challenges
- Conduct architecture reviews and drive system design decisions for reliability
- Lead post-incident reviews and drive implementation of preventive measures
Incident Management & Response
- Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution
- Develop and maintain runbooks and emergency response procedures
- Lead root cause analysis and ensure comprehensive documentation
- Participate in 24/7 on-call rotation and escalation procedures across global teams
- Interface with Engineering teams and Incident Manager during critical incident resolution
Platform Reliability & Performance
- Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, GCP)
- Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications
- Implement and maintain SLIs, SLOs, and error budgets for assigned services
- Drive capacity planning and performance optimization initiatives
Automation & Tooling
- Design automation solutions to reduce manual operational overhead
- Develop monitoring strategies using New Relic, Grafana, and Sumo Logic
- Create infrastructure-as-code for reliable deployments
- Build self-healing systems and automated remediation workflows
Qualifications
Technical Experience
- 8+ years in SRE, DevOps, or Infrastructure Engineering roles with 4+ years in senior positions
- Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred)
- Strong Linux system administration and troubleshooting
- Experience with containerization (Docker) and orchestration (Kubernetes, ECS)
- Proficiency with monitoring tools (New Relic, Grafana, Prometheus)
Leadership & Communication
- Proven track record mentoring technical teams and driving technical direction
- Experience serving as senior technical leader during critical incidents
- Strong communication skills with engineering teams and stakeholders
- Cross-functional collaboration in agile environments
SRE & Operations
- Demonstrated success implementing SRE principles in large-scale production environments
- Experience with ITIL frameworks and tools
- Background in establishing and maintaining SLAs for enterprise SaaS products
Preferred
- Authentication and identity management systems knowledge
- Infrastructure-as-code tools (Terraform, CloudFormation)
EEO Statement
iCIMS is a place where everyone belongs. We celebrate diversity and are committed to creating an inclusive environment for all employees. Our approach helps us to build a winning team that represents a variety of backgrounds, perspectives, and abilities. So, regardless of how your diversity expresses itself, you can find a home here at iCIMS. We prohibit discrimination and harassment of any kind based on race, color, religion, national origin, sex (including pregnancy), sexual orientation, gender identity, gender expression, age, veteran status, genetic information, disability, or other applicable legally protected characteristics. If you’d like to request an accommodation due to a disability, please contact us at
[email protected].
Compensation and Benefits
Competitive health and wellness benefits include medical insurance (employee and dependent family members), personal accident and group term life insurance, bonding and parental leave, lifestyle spending account reimbursements, wellness services offerings, sick and casual/emergency days, paid holidays, tuition reimbursement, retirals (PF - employer contribution) and gratuity. Benefits and eligibility may vary by location, role, and tenure. Learn more here: https://careers.icims.com/benefits