
Senior SRE Practitioner
- Malaysia
- Permanent
- Full-time
- Design, build, and maintain scalable and reliable systems.
- Monitor system performance and proactively address bottlenecks or issues.
- Implement strategies to improve system uptime and reduce downtime.
- Automation and Tooling:
- Develop and maintain automation tools for deployment, monitoring, and incident response.
- Create scripts and workflows to reduce manual intervention and improve efficiency.
- Incident Management:
- Respond to system outages and incidents, performing root cause analysis and implementing fixes.
- Develop and maintain runbooks and documentation for incident response.
- Monitoring and Observability:
- Set up and maintain monitoring tools to track system health and performance.
- Define and measure Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
- Collaboration and Communication:
- Work closely with development teams to ensure systems are designed with reliability in mind.
- Collaborate with operations teams to improve deployment processes and system management.
- Capacity Planning and Scaling:
- Analyze system usage and plan for future capacity needs.
- Implement solutions to handle traffic spikes and ensure scalability.
- Continuous Improvement:
- Identify areas for improvement in system architecture and processes.
- Advocate for best practices in reliability engineering and DevOps.
- Strong knowledge of Linux/Unix systems and networking.
- Proficiency in programming languages such as Python, Ansible, PowerShell, .Net, Java. Etc.
- Experience with cloud platforms (e.g., Azure, AWS).
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Expertise in monitoring and observability tools (e.g., App Dynamics, App Insights, Dynatrace, Grafana, ELK stack).
- Understanding of CI/CD pipelines and automation frameworks.
- Problem-solving skills and ability to perform root cause analysis.
- Excellent communication and collaboration skills.
- Experience with distributed systems and microservices architecture.
- Knowledge of database systems (SQL and NoSQL).
- Familiarity with incident management frameworks (e.g., ITIL, SRE best practices).
- Certifications in cloud technologies or DevOps tools.
- Analytical mindset with a focus on reliability and scalability.
- Passion for automation and reducing manual work.
- Ability to work under pressure and handle critical incidents effectively.
- Commitment to continuous learning and staying updated on industry trends.