DESCRIPTION
The Origination Decisions team builds and operates the machine-learning-powered system that decides whether to approve loan applications and under which conditions. The team is small (4 people) and every member owns a vertical slice of the product end-to-end — from data pipelines through model training to production deployment — for a subset of lending products. You will therefore not only lead improvements in your area of expertise, but also regularly use the full stack as an end-user, giving you first-hand insight into what works and what doesn't.
You will own the production lifecycle of our ML-based decision services: deploying them reliably, monitoring them continuously, and making them easy to evolve. This is not a traditional DevOps or SRE role. You need to understand how machine-learning systems fail — silently degrading predictions, distribution shifts, broken upstream schemas that subtly bias features — and design safeguards that catch these issues before they reach customers.
-
Design and maintain the promotion pipeline from pull request to dev, staging, and production, including the criteria and automated checks at each gate.
-
Manage containerized services on Kubernetes: image optimization, resource scaling, granular per-decider deployments.
-
Coordinate schema and API changes with the teams that maintain the upstream and downstream .NET / TypeScript services.
-
Strengthen automated PR checks: decision-impact visualizations, anomaly detection on training data and backpopulated predictions, and integration of upstream/downstream service code into automated LLM-assisted reviews.
-
Improve the Bruno API test suites that run against the dev environment after every merge, balancing coverage with cost.
-
Extend the staging validation system that replays production traffic: detect divergences in computed features, approval statistics, and schema conformance between staging and production models.
-
Design and maintain production monitoring: dashboards, alerts, and cross-service distributed tracing of the full onboarding flow.
-
Define and track ML-specific health metrics (approval rates, score distributions, feature drift) alongside standard service metrics (latency, error rates, resource usage).
-
Build tooling that transforms the internal decision trace into human-readable explanations for operations and compliance stakeholders.
-
Coordinate with upstream data providers to define fallback strategies when external data is unavailable (secondary providers, default values, deferred decisions).
-
Extend the input-validation framework so that non-critical schema violations fall back to safe defaults (with alerts) while critical violations block the decision, and simulate the impact of those fallbacks on decision quality.
-
Design and implement new endpoints as the product evolves (e.g., counter-offers, intermediary onboarding steps, modified loan conditions).
-
Integrate new data sources into the online decision path — including features from video-call analysis and a low-latency feature store for returning customers — in coordination with the pipeline engineer.
-
Profile and optimize inference time: replace heavy dependencies (e.g., LightGBM ONNX), evaluate faster data-processing libraries (e.g., Polars over pandas), and offload hot paths with compiled code where justified.
-
Keep base Docker images lean and startup times low.
-
Review pull requests in adjacent repositories (primarily C# / .NET and TypeScript / React) that affect the services immediately upstream or downstream of the decision system, to catch integration issues early.
-
Attractive compensation package, including stock options.
-
Fast-paced environment with significant growth opportunities.
-
15 annual vacation days + 7 annual personal days.
-
Option to work remotely 3-4 days per week ; or fully-remote (as long as you can come to CDMX ~twice a year)
-
Flexible work schedule
REQUIREMENTS
-
Production ML experience — You have deployed ML models to production and dealt with the failure modes specific to learned systems: silent degradation, training/serving skew, selection bias, data-pipeline breakages, and schema drift.
-
Software engineering — Strong Python skills (you will work daily with FastAPI, Pydantic, and pytest). Comfortable reading and reviewing C# and TypeScript code.
-
Containerisation & orchestration — Hands-on experience with Docker and Kubernetes in a production setting (resource management, rolling deployments, health probes).
-
Testing philosophy — You think in terms of layered validation (unit, integration, contract, shadow-traffic comparison) and know how to balance coverage against cost and speed.
-
Monitoring & observability — Experience designing dashboards, alerts, and distributed traces for services where "the service returned 200 but the answer was wrong" is a real failure mode.
-
API design — Ability to design clear, evolvable REST APIs and negotiate schema changes across teams.
-
Communication — You will be the main point of contact between Data Science and the platform engineering teams. Clear, precise written and verbal communication is essential.
-
Fluency in both Spanish and English. Most of our meetings are in Spanish, but the code and most documentation is written in English.
-
Experience with model-serving runtimes (ONNX Runtime, TensorFlow Serving, Triton) or model compilation/optimisation techniques.
-
Familiarity with Dagster, DVC, or similar ML pipeline / data-orchestration tools.
-
Familiarity with the Prometheus / Grafana observability stack.
-
Experience with performance profiling and optimisation in Python (Polars, NumPy, Numba, Cython, or Rust extensions).
-
Exposure to financial services, credit decisioning, or regulated environments where auditability and explainability matter.
-
Experience building or maintaining CI/CD pipelines with automated ML-specific validations (data quality checks, model performance gates, decision-impact analysis).
-
Knowledge of the Azure ecosystem (AKS, ACR, Azure DevOps).
-
Familiarity with API-testing tools such as Bruno or Postman for contract and integration testing.
-
Familiarity with Pants, or other similar build systems.