Principal Systems Analyst (SRE)
Anchor Search Group
- Malaysia
- Permanent
- Full-time
- Provide effective maintenance and support services to business units. Ensure the availability and robustness of the IT applications for a 24×7 mission critical system.
- Develop and implement automation tools to streamline operational tasks.
- Ensure the efficient functioning of applications, monitor performance, automate processes, and enhance system availability.
- Be highly responsive to the dynamic nature of the business environment and to quickly react against any critical system issue.
- Perform application support during normal office hours and on-call standby.
- Ensure operational health of the systems by resolving escalated problems and performing application fine-tuning and system performance improvement activities.
- Closely monitor the incident resolution and ensure SLA is adhered to.
- Troubleshooting, investigation, raise defect when necessary.
- Participate in planning and revenue assurance activities such as systems reconciliation and disaster recovery.
- Adopt good industry practices and adherence to IS Policies and Standards.
- Review design/ solution for system and application changes to ensure quality delivery and stability of business operations.
- Manage application vendors to provide timely and reliable system support.
- Lead and participate in projects aimed at improving system reliability.
- Collaborate with cross-functional teams to identify areas for enhancement, implement changes, and measure the impact.
- Monitors infrastructure components such as servers, databases, and networking.
- Build monitoring systems that focus on symptoms rather than just outages. Alerts should provide actionable insights for rapid response.
- Bachelor's degree in Computer Science, Computer Engineering, Information Technology or related fields. At least 7 to 9 years of relevant working experience and preferably with Telco.
- Certifications related to Site Reliability Engineering is a plus.
- Experienced in Program Planning and Initiatives, shows ability to drive SRE initiatives across departments in a large organization, developed strategic plans, set goals, and collaborate with stakeholders to align SRE efforts with overall business goals.
- Experienced as a Site Reliability Engineer or in a similar role, specifically handling reliability improvement projects in large-scale, complex, business-critical application environments and ITSM/ITIL framework.
- Proficient in containerization technologies and container orchestration platforms (e.g. Docker and Kubernetes). Understand the concept of container networking, storage, and security.
- Proficient in cloud platforms (e.g. AWS, Google Cloud Platform (GCP), or Microsoft Azure) and cloud services (e.g. compute instances, storage, databases, networking, and monitoring tools).
- Proficient in CI/CD pipelines and tools like Jenkins, GitLab CI/CD, or CircleCI for automating software builds, testing, and deployment processes.
- Proficiency in languages such as Python, Java, Go, or Ruby and scripting skills for automation tasks and tool development. Knowledge of tools such as SPLUNK, Kibana will be a plus.
- Ability to communicate asynchronously and work effectively with cross-functional teams.
- Ability to quickly master in-depth application and business domain knowledge.
- Ability to coach junior team members.
- Willing to work on extended hours when needed