About RouteGenie
RouteGenie is a US-based SaaS platform serving the Non-Emergency Medical Transport (NEMT) industry across North America. Our software powers dispatch, routing, billing, and operations for hundreds of transportation providers moving millions of medically vulnerable passengers every year.
AI is a strategic priority for us. We believe agentic systems will materially change how dispatchers, schedulers, and operators do their work, and this role exists because we want a strong engineer to help us build that future inside our product.
The Role
You will design and ship agentic workflows and LLM-powered features inside the RouteGenie product — dispatcher copilots, intelligent routing assistants, document and intake automation, and customer-facing voice and chat agents. You'll also build internal-operations agents that accelerate our own marketing, sales, support, and back-office workflows.
As RouteGenie's first dedicated AI engineer, you'll propose the patterns, recommend the tools, and shape how we measure quality. The Engineering Project Manager and Head of Technology will vet and partner with you on major architectural decisions. A strong candidate will carry significant influence on the direction we take, based on the strength of their thinking.
You'll sit on the LatAm Hub engineering team in Mexico and work daily with European-based Engineering and US-based product and operations stakeholders.
What This Role Is Not
To be explicit: this is not a foundation-model research role, not a pure ML or data-science role, and not a general backend role that occasionally touches AI. AI features will be your full-time focus.
Key Responsibilities
Architect and ship agentic workflows and LLM-powered product features: Handle planning, memory, state management, tool orchestration, guardrails, human-in-the-loop checkpoints, and voice/chat interfaces.
Partner with leadership: Collaborate with product and engineering leadership during feature ideation and scoping — providing pragmatic input on level of effort, technical feasibility, and the realistic likelihood that proposed AI approaches will work in production versus remain demo-grade.
Design retrieval architectures: Select and implement the right retrieval architecture for each use case, choosing between vector retrieval, long-context/cache-augmented, tool-based agentic retrieval, graph-based, or hybrid approaches. Avoid one-size-fits-all retrieval.
Establish evaluation systems: Build the systems that prove AI features actually work. Define accuracy, correctness, and safety metrics; assemble test datasets; run pre-release benchmarks; and monitor production quality to catch regressions when models, prompts, or data change. We do not ship LLM features we cannot measure.
Production hardening: Integrate AI features into the existing RouteGenie stack and harden them for production — managing APIs, latency and cost budgets, prompt versioning, observability, and PII/PHI safety while working alongside platform and product engineers.
Internal operations pipelines: Build agentic pipelines for RouteGenie's internal operations — automating marketing, sales, support, and back-office workflows where AI agents can accelerate manual effort.
Team enablement: Educate and enable the broader engineering team to incorporate agentic flows into the features they build so AI capabilities are woven into normal product work across the team, not siloed to one engineer.
Development championship: Champion AI-assisted development practices across the LatAm Hub engineering team.
What Success Looks Like
In your first 90 days:
You've formed a working point of view on the right model and execution environment for our top AI feature candidates.
You've stood up the evaluation and observability scaffolding the rest of the team will use.
You've shipped one small production AI feature with a documented quality methodology.
By the end of year one:
You've shipped 3–5 AI features in production, each with measured quality and a regression strategy that survives model and prompt updates.
At least one internal-operations agent is live and measurably reducing manual effort.
You're the engineer others come to when they have an "is this even an AI problem?" question.
Technology Stack
You'll work with and bring opinions on choosing between the following:
LLM Providers & APIs
Anthropic Claude (primary), OpenAI, AWS Bedrock
Local / Self-Hosted LLMs
Ollama, LM Studio, llama.cpp, vLLM; open-weight model families (Llama, Qwen, Mistral, etc.)
Agent Frameworks
LangChain / LangGraph, LlamaIndex, OpenAI Agents SDK, or equivalent
Retrieval & Knowledge
Vector databases (Pinecone, Weaviate, pgvector); RAG, cache-augmented generation, tool-based agentic retrieval, GraphRAG, hybrid approaches
Voice AI
ElevenLabs, VAPI, LiveKit, Deepgram
LLM Observability & Eval
LangSmith, Braintrust, Phoenix, Helicone, or similar
AI-Assisted Development
Claude Code
RouteGenie Stack
Python, Django, PostgreSQL, Angular, TypeScript
Qualifications & Requirements
Required Qualifications:
Experience: 3+ years of software engineering experience.
Production AI: 1+ year hands-on experience with production LLM / AI features shipped to real users (not prototypes or coursework).
Languages: Strong Python skills; comfort with TypeScript.
Frameworks: Hands-on experience with at least one agent framework and multiple retrieval/context-augmentation approaches, alongside the judgment to choose between them.
APIs: Production experience with major LLM provider APIs from our Tech Stack.
Architectural Judgment: Sound judgment on AI architecture choices. Ability to select the right model and execution environment (third-party API, foundational provider, local/self-hosted open-weight, specialized voice or embedding services) against cost, latency, accuracy, and data-residency constraints. Knows when traditional ML or no AI at all is the right call, and can implement classical ML when it fits.
Quality Measurement: Demonstrated experience measuring AI feature quality in production. Ability to describe specific metrics defined, test datasets built, and how regressions were detected and addressed when models, prompts, or data changed.
Communication: Working professional English; strong async written communication for collaboration across Mexico, Europe, and US time zones.
Strongly Preferred:
Voice AI: Experience with Voice AI. NEMT dispatch and customer-service flows are voice-heavy, and voice agents will be a major product surface.
Regulated Data: Experience in a healthcare or regulated-data context (HIPAA, PII/PHI handling) and the disciplines that come with it (audit logging, data minimization, access controls).
Self-Hosting: Local / self-hosted LLM experience running open-weight models on-prem or in a VPC. Critical for PHI-sensitive use cases where data cannot leave our infrastructure.
Anthropic Ecosystem: Claude API / Anthropic SDK experience — including Claude-specific patterns (extended thinking, prompt caching, tool use, computer use, Agents SDK).
Preferred (Nice-to-Have):
LLM observability / eval tooling experience (LangSmith, Braintrust, Phoenix, Helicone, or similar).
Cost and latency optimization at LLM scale (prompt caching, model routing, token budgeting).
Traditional ML / data science background (model training, feature engineering, evaluation methodology).
Django / PostgreSQL background.
Multi-tenant SaaS experience.
Open-source AI contributions or public agent projects.