
Site Reliability Engineer (SRE) - PD
- Malaysia
- Permanent
- Full-time
- Partner with development teams to integrate reliability into the software development lifecycle.
- Design and implement highly available and fault-tolerant architectures for mission-critical applications.
- Design, implement, manage, and optimize Kubernetes clusters for availability, scalability, and security.
- Perform upgrades, patches, and security hardening for Kubernetes infrastructure.
- Automate application deployment, scaling, and infrastructure provisioning.
- Implement CI/CD pipelines for deploying and updating Kubernetes applications.
- Develop and maintain IaC scripts (e.g., Terraform, Ansible) for provisioning and managing cloud and container resources.
- Utilize AWS, GCP, or Azure services for Kubernetes deployments and integrations.
- Apply cloud-native best practices for scalability and performance.
- Implement monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK, etc.).
- Proactively identify and resolve performance bottlenecks and reliability issues.
- Respond to and resolve production incidents with minimal downtime.
- Conduct post-incident analysis and implement preventive measures.
- Perform capacity planning to ensure the Kubernetes infrastructure can accommodate current and future workloads in the cloud.
- Collaborate with the security team to implement and enforce Kubernetes and cloud security best practices.
- Perform regular vulnerability assessments and compliance checks.
- Work cross-functionally with DevOps, security, and development teams.
- Maintain comprehensive documentation for processes and configurations.
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- Minimum 3 years of proven experience as a Site Reliability Engineer or similar functional role.
- Strong programming or scripting skills, with proficiency in languages such as Bash, Python, Go, or Java.
- Extensive experience with Kubernetes orchestration, including cluster setup, management, and troubleshooting.
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible) and cloud platforms.
- Solid understanding of virtualization and networking concepts and principles.
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills.
- Knowledge of cloud security best practices.
- Familiarity with microservices frameworks.
- Advantage: Certified Kubernetes Administrator (CKA) or equivalent certification.