As an Infrastructure Platform Engineer, you will build and maintain the infrastructure that powers both our AI application runtime and model training workflows. You will own secure, observable, and scalable environments that support model hosting, prompt execution, agent tools, and internal model training pipelines. Your work ensures that product and platform engineers can deploy and scale AI workloads efficiently across cloud and on-prem infrastructure. This role blends DevOps, ML systems engineering, and platform development for AI workloads. Responsibilities 1. Application Infrastructure Manage model routing, fallback, and token usage enforcement across LLM providers. Operate and optimize model-serving infrastructure (e.g., vLLM, Triton, OpenAI proxies). Build and maintain tool execution runtimes and internal service orchestration layers. Implement secure API gateways, rate limiting, authentication, and quota management. 2. Training Infrastructure Develop training pipelines for pre-training and other fine-tuning workflows. Manage GPU scheduling, storage access, and experiment tracking (e.g., MLflow, Weights & Biases). Partner with AI researchers and platform engineers to operationalise training and evaluation runs. Maintain dataset versioning, access control, and data preprocessing pipelines. 3. Platform Operations Maintain CI/CD systems for platform services and runtime components. Establish observability and monitoring systems across model, memory, and agent services. Apply best practices for infrastructure security, availability, and cost optimization. Document infrastructure components and standard deployment practices. Qualifications 6+ years' experience in infrastructure engineering, DevOps, or ML systems Strong command of Kubernetes, Terraform, and cloud-native architecture (AWS, Azure, GCP) Experience with containerization, CI/CD, and API security practices Prior exposure to model hosting or ML pipeline orchestration Understanding of networking concepts including VPNs, VNets, and hybrid connectivity. Familiarity with security best practices for cross-platform infrastructure. Experience with on-prem infrastructure including networking, storage hardware