Senior Site Reliability Engineer AI Infrastructure
Andromeda Cluster
Description
Senior Site Reliability Engineer - AI Infrastructure
Location: Global Remote / San Francisco · Full-Time
About Andromeda
Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.
We began with a single managed cluster â but it filled almost instantly. Since then, weâve been quietly building the systems, network, and orchestration layer that makes the worldâs AI infrastructure more accessible.
Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where itâs needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.
Our long-term vision is to build the liquidity layer for global AI compute â a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the worldâs financial markets.
We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
The Role
This is not a generalist SRE role.
You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems.
Weâre looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric â kernel â framework.
What Youâll Own
GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency.
Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time.
Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput.
Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations.
Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics.
Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes.
What Weâre Looking For
GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation.
High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale.
Distributed Training & ML Frameworks:
Tags
Apply for this Position
About Andromeda Cluster
Company scraped from remoteok
Job Stats
Hiring Across Borders?
Interview Prep Guide
Preparation Strategy
To prepare for this role, focus on reviewing GPU cluster architecture, distributed systems, and performance optimization techniques. Practice designing and discussing complex systems and trade-offs. Review your past experiences with GPU infrastructure, customer technical partnerships, and performance optimization. Prepare to discuss specific examples and scenarios. Use resources such as NVIDIA's GPU documentation, distributed systems literature, and system design blogs to improve your knowledge and skills.
Likely Interview Rounds
- 1. Technical~60 min
What to prep: Review GPU cluster architecture, distributed training, and performance optimization techniques. Be prepared to discuss your experience with GPU infrastructure and large-scale training workloads.
- How do you design a GPU cluster for large-scale distributed training?
- What are the common failure modes of distributed training and how do you mitigate them?
- How do you optimize the performance of a GPU cluster for training throughput and cost efficiency?
- 2. System design~90 min
What to prep: Review system design principles, GPU cluster architecture, and distributed systems. Practice designing and discussing complex systems and trade-offs.
- Design a system for multi-provider, multi-region GPU compute clusters
- How would you implement topology-aware scheduling, networking, and storage decisions for a large-scale GPU cluster?
- How do you ensure the reliability and performance of a GPU cluster in a production environment?
- 3. Behavioral~60 min
What to prep: Review your past experiences with GPU infrastructure, customer technical partnerships, and performance optimization. Prepare to discuss specific examples and scenarios.
- Tell me about a time when you had to troubleshoot a complex issue with a GPU cluster
- How do you handle customer technical partnerships and onboard new customers?
- Can you describe a situation where you had to optimize the performance of a GPU cluster for a specific workload?
Most Likely Questions
- What are the challenges of running large-scale distributed training workloads on GPU infrastructure?
- How do you ensure the reliability and performance of a GPU cluster in a production environment?
- Can you describe your experience with GPU cluster architecture and distributed systems?
- How do you optimize the performance of a GPU cluster for training throughput and cost efficiency?
- What are some common failure modes of distributed training and how do you mitigate them?
Common Pitfalls
- Lack of experience with GPU infrastructure and large-scale distributed training
- Inadequate understanding of system design principles and distributed systems
- Insufficient knowledge of performance optimization techniques for GPU clusters
- Poor communication skills and inability to handle customer technical partnerships
Free Prep Resources
- • NVIDIA's GPU documentation
- • Distributed Systems literature (e.g. 'Designing Data-Intensive Applications' by Martin Kleppmann)
- • System Design Primer (GitHub: donnemartin)
- • Kubernetes documentation