Systems Engineer, AI Infrastructure
Singapore
Beijing
Palo Alto
Senior / Technical Staff
Remote
Role Overview
We are seeking a Systems Engineer to architect and manage the highly flexible, scalable infrastructure that powers our model training and inference.
You will be responsible for the reliability, efficiency, and adaptability of large-scale GPU clusters running long-horizon jobs, while ensuring the infrastructure can evolve rapidly alongside changes in model architecture and learning paradigms. Unlike conventional LLM systems, our models challenge several implicit assumptions in standard training and inference stacks. The training semantics, state management, and iteration patterns differ meaningfully from traditional recipes.
Key Responsibilities
Distributed Training Optimization: Design and implement robust parallelism strategies tailored for novel architectures across GPU clusters.
Reliability, Fault Tolerance & Checkpointing: Build automated systems for health monitoring, silent failure detection, and ultra-fast asynchronous checkpointing to ensure high availability for long-running jobs.
Required Qualifications
3+ years in HPC, cloud infrastructure, or distributed ML systems.
Deep expertise and understanding in PyTorch Distributed (FSDP2) and collective communication primitives.
Strong system-level programming skills (C++, Python) and experience with cluster orchestrators (Slurm, Kubernetes).
Proficiency in C++ and familiarity with GPU profiling tools (Nsight Systems, PyTorch Profiler).
Preferred Qualifications
Experience training LLMs in 10B scale.
Worked on newer models (e.g., Encoder-Decoder, Chunked Attention) with extensive architecture experiments.
Contributions to open-source distributed training libraries (e.g., PyTorch, Megatron-LM).
Familiarity with fp8 training, mixed precision, or advanced quantization techniques.
