Lead Site Reliability Engineer

Kuala Lumpur
Permanent
Full-time

1 month ago

POSITION OVERVIEW : Software Development Specialist POSITION GENERAL DUTIES AND TASKS : A Site Reliability Engineer (SRE) for VMware Cloud Foundation (VCF) focuses on ensuring the reliability, availability, and performance of the VCF platform through automation, monitoring, and proactive problem-solving. This role involves developing and implementing strategies to improve the platform&aposs stability, collaborating with development and operations teams, and contributing to the overall VCF roadmap. Key Responsibilities: Platform Reliability and Availability: Design and implement strategies to ensure high availability and performance of VCF, including monitoring, alerting, and incident response. Automation and Tooling: Develop and maintain automation scripts and tools to streamline operations, improve efficiency, and reduce manual intervention. Performance Monitoring and Optimization: Monitor system performance, identify bottlenecks, and implement solutions to optimize resource utilization and overall performance. Incident Management and Resolution: Participate in incident response, troubleshoot complex issues, and contribute to post-incident reviews to prevent recurrence. Collaboration and Communication: Work closely with development, operations, and other teams to ensure seamless integration and efficient operations. Documentation and Knowledge Sharing: Create and maintain comprehensive documentation for system configurations, troubleshooting procedures, and operational best practices. Continuous Improvement: Identify areas for improvement in the VCF platform and work with relevant teams to implement changes and enhancements. Required Skills and Experience: Strong understanding of VMware Cloud Foundation components, including vSphere, vSAN, NSX, and vRealize Suite. 5-6 years of relevant experience in VCF is required. Proficiency in scripting languages such as Python, Go, or PowerShell. Experience with automation tools and frameworks, such as Ansible, Terraform, or SaltStack. Solid understanding of cloud computing concepts and principles. Experience with monitoring and alerting tools, such as Prometheus, Grafana, or vRealize Operations. Strong problem-solving and troubleshooting skills. Excellent communication and collaboration skills. Show more Show less

foundit

Apply Now