Our Purpose
Mastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we’re helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships and networks combine to deliver a unique set of products and services that help people, businesses and governments realize their greatest potential.
Title and Summary
Site Reliability Engineering Manager
The Xborder team is looking for a Site Reliability Engineering Manager who can help us solve problems, implement automation, and leverage best practices.
- Are you a born problem solver who loves to figure out how something works?
- Are you a detail -oriented individual who enjoys complex problem solving?
- Do you love determining the correct actions required to fix a problem?
- Do you have a low tolerance for manual work and look to automate everything you can?
Business Operations is leading the Site Reliability Engineering (SRE) transformation at Mastercard through our tooling and by being an advocate for change & standards throughout the development, quality, release, and product organizations. We need team members with an appetite for change and pushing the boundaries of what can be done with automation. Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must.
Overview
A Site Reliability Engineering (SRE) Manager is responsible for ensuring the reliability, scalability, and production readiness of technology platforms by closely partnering with engineering teams across the development lifecycle. The role focuses on embedding operational excellence—including availability, performance, monitoring, automation, and self-healing capabilities—into all solutions. As a leader of SRE transformation, the manager drives the adoption of standards, tools, and best practices across development, quality, and release functions. They oversee incident management through effective triage and root cause analysis, promote a proactive “shift-left” approach, and lead efforts in risk management, compliance, and process standardization. Ultimately, the SRE Manager aligns product and customer priorities with operational needs, continuously improving system performance and enhancing overall customer experience.
Key Responsibilities
- Lead and drive the end-to-end service lifecycle, ensuring teams effectively engage from inception and design through deployment, operations, and continuous improvement, while aligning with business objectives.
- Oversee ITSM practices across the platform, establishing governance and ensuring teams proactively identify operational gaps and resiliency risks, while driving action plans in partnership with engineering teams.
- Provide strategic direction for production readiness, guiding teams on system design consulting, capacity planning, and launch readiness reviews to ensure scalable and reliable service delivery.
- Own service reliability outcomes, by defining KPIs/SLOs and leading the monitoring of availability, latency, and system health, ensuring accountability across the team.
- Drive scalability and operational efficiency, promoting automation, standardization, and continuous improvement initiatives to enhance reliability, reduce toil, and accelerate delivery velocity.
- Lead incident management excellence, establishing best practices for sustainable incident response, ensuring blameless postmortems, and driving root cause remediation and preventive actions at scale.
- Champion a holistic, cross-stack problem-solving approach, enabling teams to effectively manage complex production incidents and improve mean time to recovery (MTTR).
- Manage and develop a high-performing global team, fostering collaboration across geographies and time zones while ensuring alignment, engagement, and productivity.
- Build and nurture talent, through coaching, mentoring, and career development, while promoting a strong culture of knowledge sharing and continuous learning.
All about you
- Bachelor’s degree in computer science, Information Technology, or a related technical field (e.g., Engineering, Physics, Mathematics), or equivalent practical experience. Experience in financial services is preferred.
- 8–15 years of relevant experience in Site Reliability Engineering, Infrastructure, or DevOps roles, with a combination of hands-on technical expertise and early leadership responsibilities.
- Strong technical foundation across enterprise platforms, Linux/UNIX systems, operating systems, and database environments (Oracle/SQL, DBA), with the ability to provide technical guidance and support to the team.
- Experience with observability and monitoring tools (e.g., Splunk, Dynatrace), driving improved system visibility, performance, and reliability.
- Solid experience in DevOps and CI/CD practices, with the ability to support and guide automation, deployment pipelines, and operational improvements.
- Proficiency in one or more programming or scripting languages such as Python, Java, Go, C/C++, Perl, or Ruby, with practical application in automation or system improvements.
- Proven exposure to automation initiatives, with the ability to contribute to and help scale solutions that reduce operational toil and improve efficiency.
- Working knowledge of ITSM processes, including incident, problem, and change management, with experience applying these practices in production environments.
- Experience supporting customer-facing platforms, ensuring service reliability, availability, and effective issue resolution.
- Strong analytical and problem-solving skills, with the ability to troubleshoot complex issues and support the team during high-severity incidents.
- Ability to prioritize, organize, and manage multiple workstreams, balancing operational needs with ongoing improvements.
- Effective communication and collaboration skills, with experience working across engineering, product, and operations teams in a global environment.
- Demonstrated experience mentoring and supporting junior engineers, contributing to team development and knowledge sharing (formal people management experience is a plus but not mandatory).
- Understanding of large-scale distributed systems, including basic design principles, performance considerations, and troubleshooting approaches.
- Exposure to Artificial Intelligence use cases and implementation is a plus, particularly in relation to automation, observability, or operational insights.
Corporate Security Responsibility
All activities involving access to Mastercard assets, information, and networks comes with an inherent risk to the organization and, therefore, it is expected that every person working for, or on behalf of, Mastercard is responsible for information security and must:
Abide by Mastercard’s security policies and practices;
Ensure the confidentiality and integrity of the information being accessed;
Report any suspected information security violation or breach, and
Complete all periodic mandatory security trainings in accordance with Mastercard’s guidelines.