Description and Requirements We are seeking a skilled Golang Engineer to design and implement comprehensive monitoring and observability solutions for our cloud infrastructure. This role is responsible for building scalable monitoring systems that provide real-time visibility into system health and performance across Linux, OpenStack, and Kubernetes environments. Key Responsibilities Design and develop core components of Kubernetes-based contai ner platforms using Golang , focusing on control p lane extensions, operators, and cloud-native service meshes . Implement and optimize Kubernetes networking (CNI plugins like Cali co/Cilium) and storage solutions (CSI drivers, Rook/Cep h integration), addressing challenges in multi-tenant isolation and high-throughput data paths . Troubleshoot deep-level Kubernetes issues (e.g., etcd corruption, kube-scheduler deadlocks, CNI policy conflicts) using Golang debugging tools ( pprof , delve ) and log analysis . Build automation framew orks for cluster lifecycle management, security hardening, and observability using Golang (primary) and Python (secondary for scripti ng) . Collaborate with infras tructure teams to align platform capabilities with AI workload requirements, optimizing resource scheduling for GPU/accelerator workloads . Qualifications Technical Expertise : Mastery of Golang : 3+ years building pro duction-grade systems with Goroutines, interfaces, and standard library (e.g., net/http , k8s.io/client-go ) . Kubernetes Internals : Deep understanding of control plane components (API server, scheduler, controller manager) and ability to extend via CRDs/Operators . Network/Storage Profici ency : Hands-on experience s electing and implementing CNI (VXLAN/BGP modes) and CSI solutions (RBD, iSCSI), with performance benchmarking skills . Linux/Container Experti se : Proficient in cgroups , namespaces, and container runtimes (containerd, CRI-O) for debugging resource leaks or security flaws . Experience : 3+ years developing clo ud infrastructure with Golang as primary langu age , including at least on e major Kubernetes platform project (e.g., cluster autoscaler, custom scheduler) . Demonstrated ability to resolve critical production issues (e.g., etcd leader election failures, network policy drops) in large-scale clusters ( 1k nodes) . Soft Skills : Rigorous analytical app roach to system design and failure root-cause analysis. Ability to document com plex technical concepts for cross-team alignment. Preferred Add-ons Kubernetes SIG contribu tions (e.g., networking, sto rage, or scheduling working groups) . Experience with eBPF-based tools (Cilium, Pixie) for ad vanced network observability . Proficiency in Python for infrastructure scr ipting (Ansible/Terraform integrations) or Java for enterprise service interoperability . Familiarity with service meshes (Linkerd, Istio) and GitOps pipelines (Argo CD, Fl ux) . Knowledge of cloud-native security (OPA/Gatekeeper, Kyver no) and AI /ML workload optimization .