Senior Site Reliability Engineer (Cloud Native & Observability)
Kuala Lumpur
Permanent
Full-time
2 months ago
Role Overview: Responsible for highly resilient, scalable, and cost-optimized systems on multi-cloud environments. Focus on infrastructure as code, observability, chaos engineering, and Kubernetes ecosystem stability across distributed systems used by millions of users. Design and implement scalable and reliable systems. Monitor system health using tools like Prometheus, Grafana, or Datadog. Manage CI/CD pipelines and infrastructure automation (e.g., Jenkins, GitHub Actions). Troubleshoot incidents and ensure root cause analysis is completed. Work with DevOps and development teams to improve system performance. Build tools to automate operations and reduce manual intervention (IaC). Key Responsibilities: Architect and maintain multi-region Kubernetes clusters (AKS/EKS/GKE) with Istio/Linkerd service mesh. Implement full-stack observability using OpenTelemetry, Grafana Loki, and Jaeger. Build self-healing infrastructure with tools like KEDA, Argo CD, Crossplane. Design and manage CI/CD pipelines with GitOps approach (FluxCD/Argo CD). Conduct chaos testing using Gremlin or LitmusChaos to validate system resilience. Work with finance and ops teams on FinOps strategies for optimizing cloud usage (Spot instances, autoscaling policies). Implement policy-as-code for security compliance via OPA/Gatekeeper. Technology Stack: Languages: Go, Python, Bash Cloud: AWS, Azure, GCP IaC Tools: Terraform, Helm, Pulumi Observability: Prometheus, Grafana, ELK, New Relic Certifications Preferred: CKA, CKAD, Terraform Associate, Google SRE, FinOps Certified Practitioner Requirements: 8+ years of experience in DevOps, infrastructure, or cloud engineering roles Minimum 5 years in a dedicated Site Reliability Engineering (SRE) role Bachelor's degree in Computer Science, Engineering, or a related field Familiarity with Terraform, Ansible, or other infrastructure automation tools Experience with public cloud platforms (AWS, Azure, or GCP) Strong scripting skills in Python, Bash, or Go