We are looking for skilled Site Reliability Engineering (SRE) / Lead Engineer with a minimum of 8 years of experience to join a dynamic team within a leading organization. This role must have deep expertise in Application Performance Monitoring (APM), Infrastructure as Code (IaC), automation, and distributed tracing using OpenTelemetry.
As a SRE lead, he will guide the design, implementation, and continuous improvement of observability solutions, ensuring system reliability, performance, and scalability while fostering best practices in SRE and DevOps.
- -Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements.
- -Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices.
- -Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments.
- -Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency.
- -Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies.
- -Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements.
- -Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices.
- -Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships.
- -Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence.
Technical Skills Required:
- - 8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities.
- - Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation.
- -Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace.
- -Strong proficiency in Infrastructure as Code (IaC) using Terraform.
- -Solid understanding of cloud platforms including AWS, GCP, or Azure.
- -Experience with automation/configuration management tools like Ansible, Chef, or Puppet.
- -Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps.
- -Experience managing Kubernetes and containerized environments (Docker, Helm).
- -Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk.
- -Excellent leadership, communication, and collaboration skills.
- Remote work in Guadalaraja, Jaliso
- Work hours Monday to Friday, 09:00 – 18:00
- Advanced English skills are mandatory
- Attractive Salary + Premium Benefits
- Performance bonuses, grocery coupons, and savings are found.
- Aguinaldo, premium vacations, and vacations paid
- SGMM Medical insurance, family, and Life insurance.
Candidates must include their compensation expectations in their applications and resumes in English.