2 publicaciones de empleo destacadas de lead devops engineer (remote) en mexico (¡Están contratando!)

ML Production Engineer — Origination Decisions
Aviva Financiera —México
- Tiempo completo
- Home Office (Desde casa)
- Azure
- DevOps
- Inglés
- Vacaciones adicionales o permisos con goce de sueldo
Postulación rápida
Security Architect - Americas
ChainGPT —México
- Tiempo completo
- Home Office (Desde casa)
- Azure
- Administración de personal
- Informática
Postulación rápida

Quiero recibir la alerta de empleo más reciente para lead devops engineer (remote) en mexico

Al ingresar a tu cuenta, aceptas las Condiciones del servicio de SimplyHired y das tu consentimiento a nuestra Política de privacidad y Cookies.

ML Production Engineer — Origination Decisions

Aviva Financiera -
México

Postulación rápida

Información del empleo

Tiempo completo

Beneficios

Opciones sobre acciones
Vacaciones adicionales o permisos con goce de sueldo

Cualificaciones

TensorFlow
Azure
Rust
Derecho
Ingeniería de software
React
Kubernetes
Plurilingüe
Tooling
DevOps
C#
NumPy
Inglés
.NET
Pandas
Docker
TypeScript
Python

Descripción completa del empleo

DESCRIPTION

About the team

The Origination Decisions team builds and operates the machine-learning-powered system that decides whether to approve loan applications and under which conditions. The team is small (4 people) and every member owns a vertical slice of the product end-to-end — from data pipelines through model training to production deployment — for a subset of lending products. You will therefore not only lead improvements in your area of expertise, but also regularly use the full stack as an end-user, giving you first-hand insight into what works and what doesn't.

The role

You will own the production lifecycle of our ML-based decision services: deploying them reliably, monitoring them continuously, and making them easy to evolve. This is not a traditional DevOps or SRE role. You need to understand how machine-learning systems fail — silently degrading predictions, distribution shifts, broken upstream schemas that subtly bias features — and design safeguards that catch these issues before they reach customers.

Key responsibilities

Deployment & release management

Design and maintain the promotion pipeline from pull request to dev, staging, and production, including the criteria and automated checks at each gate.
Manage containerized services on Kubernetes: image optimization, resource scaling, granular per-decider deployments.
Coordinate schema and API changes with the teams that maintain the upstream and downstream .NET / TypeScript services.

Testing & quality gates

Strengthen automated PR checks: decision-impact visualizations, anomaly detection on training data and backpopulated predictions, and integration of upstream/downstream service code into automated LLM-assisted reviews.
Improve the Bruno API test suites that run against the dev environment after every merge, balancing coverage with cost.
Extend the staging validation system that replays production traffic: detect divergences in computed features, approval statistics, and schema conformance between staging and production models.

Monitoring & observability

Design and maintain production monitoring: dashboards, alerts, and cross-service distributed tracing of the full onboarding flow.
Define and track ML-specific health metrics (approval rates, score distributions, feature drift) alongside standard service metrics (latency, error rates, resource usage).
Build tooling that transforms the internal decision trace into human-readable explanations for operations and compliance stakeholders.

Reliability & graceful degradation

Coordinate with upstream data providers to define fallback strategies when external data is unavailable (secondary providers, default values, deferred decisions).
Extend the input-validation framework so that non-critical schema violations fall back to safe defaults (with alerts) while critical violations block the decision, and simulate the impact of those fallbacks on decision quality.

API design & integration

Design and implement new endpoints as the product evolves (e.g., counter-offers, intermediary onboarding steps, modified loan conditions).
Integrate new data sources into the online decision path — including features from video-call analysis and a low-latency feature store for returning customers — in coordination with the pipeline engineer.

Performance optimization

Profile and optimize inference time: replace heavy dependencies (e.g., LightGBM ONNX), evaluate faster data-processing libraries (e.g., Polars over pandas), and offload hot paths with compiled code where justified.
Keep base Docker images lean and startup times low.

Cross-team code review

Review pull requests in adjacent repositories (primarily C# / .NET and TypeScript / React) that affect the services immediately upstream or downstream of the decision system, to catch integration issues early.

Benefits

Attractive compensation package, including stock options.
Fast-paced environment with significant growth opportunities.
15 annual vacation days + 7 annual personal days.
Option to work remotely 3-4 days per week ; or fully-remote (as long as you can come to CDMX ~twice a year)
Flexible work schedule

REQUIREMENTS

Required skills

Production ML experience — You have deployed ML models to production and dealt with the failure modes specific to learned systems: silent degradation, training/serving skew, selection bias, data-pipeline breakages, and schema drift.
Software engineering — Strong Python skills (you will work daily with FastAPI, Pydantic, and pytest). Comfortable reading and reviewing C# and TypeScript code.
Containerisation & orchestration — Hands-on experience with Docker and Kubernetes in a production setting (resource management, rolling deployments, health probes).
Testing philosophy — You think in terms of layered validation (unit, integration, contract, shadow-traffic comparison) and know how to balance coverage against cost and speed.
Monitoring & observability — Experience designing dashboards, alerts, and distributed traces for services where "the service returned 200 but the answer was wrong" is a real failure mode.
API design — Ability to design clear, evolvable REST APIs and negotiate schema changes across teams.
Communication — You will be the main point of contact between Data Science and the platform engineering teams. Clear, precise written and verbal communication is essential.
Fluency in both Spanish and English. Most of our meetings are in Spanish, but the code and most documentation is written in English.

Nice-to-haves

Experience with model-serving runtimes (ONNX Runtime, TensorFlow Serving, Triton) or model compilation/optimisation techniques.
Familiarity with Dagster, DVC, or similar ML pipeline / data-orchestration tools.
Familiarity with the Prometheus / Grafana observability stack.
Experience with performance profiling and optimisation in Python (Polars, NumPy, Numba, Cython, or Rust extensions).
Exposure to financial services, credit decisioning, or regulated environments where auditability and explainability matter.
Experience building or maintaining CI/CD pipelines with automated ML-specific validations (data quality checks, model performance gates, decision-impact analysis).
Knowledge of the Azure ecosystem (AKS, ACR, Azure DevOps).
Familiarity with API-testing tools such as Bruno or Postman for contract and integration testing.
Familiarity with Pants, or other similar build systems.

Postulación rápida

Refina tu búsqueda

lead devops engineer (remote) empleos en mexico