Data Center System Operations Engineer

  • Johor
  • Permanent
  • Full-time
  • 1 month ago
JOB DESCRIPTION We are seeking a data center system operation engineer to join our team to support the daily operation in a state-of-the-art GPU cluster. This role is to ensure the reliability, scalability, and efficiency of our data center operations, supporting high-performance GPU infrastructure for cutting-edge AI workloads. Key Responsibilities: Oversee daily operations of GPU clusters and data center systems. Monitor system health, performance, and capacity using industry-standard tools and frameworks. Respond to and resolve operational incidents, ensuring minimal downtime and maximum availability. Manage the deployment, configuration, and optimization of GPU servers, network devices, and supporting infrastructure (e.g. CPU servers and storage). Perform hardware diagnostics and preventative maintenance for GPU servers, storage, and networking equipment. Troubleshoot system issues related to hardware, operating systems, and applications. Work closely with cross-functional teams, including network engineers, system administrators, and developers, to support AI workloads. Maintain accurate documentation for system configurations, processes, and incident reports. Implement and enforce security best practices in system operations. Identify and propose improvements to enhance system performance, reduce costs, and optimize resource utilization. Desired Skills: Familiarity with GPU hardware (e.g., NVIDIA GPUs) and AI/ML workloads is a strong advantage. Experience with storage systems (e.g., NVMe, SAN, NAS), networking concepts, and protocols (e.g., TCP/IP, RDMA) will be advantageous. Knowledgeable in operating ticketing system and troubleshooting process in CPU/GPU cluster. Familiarity with networking concepts, including TCP/IP, VLANs, and load balancing. Experience in managing bare metal servers, GPU infrastructure, or high-performance computing systems will be an added advantage BASIC QUALIFICATIONS Bachelor's degree in Computer Science, Information Technology, Electrical Engineering, or a related field. Equivalent experience will be considered. 2+ years of experience in system operations within IT infrastructure or cloud services. Hands-on experience in IT hardware replacement. Experience in data center operations, system administration, or a similar role. Knowledge of server hardware, including GPU cards, CPU configurations, and storage solutions. Understanding of Linux fundamentals and Kubernetes environments. Familiarity with monitoring tools (e.g., Prometheus, Grafana) and logging frameworks.

foundit

Similar Jobs

  • Mechanical Engineer (Data Center)

    • Kulai, Johor
    Who We Are Princeton Digital Group (PDG) is a leading developer and operator of Internet infrastructure. Headquartered in Singapore with presence and operations in China, Singapore…
    • 6 days ago
  • Data Center ELV Engineer

    • Kulai, Johor
    • RM 4,000-7,000 per month
    Key Responsibilities Monitor and maintain Data center ELV system includes facilities servers, workstations, network switch, firewall, CCTV, access control, BMS, DCIM, and related E…
    • 2 days ago
  • Data Center Mechanical Engineer

    • Kulai, Johor
    • RM 4,000-6,000 per month
    Requirement Candidate must possess at least a Bachelor's Degree in electrical, mechanical or technology related. Proficient in Microsoft Office, including Outlook, Word, Excel and …
    • 2 days ago