Senior Site Reliability Engineer AI Infrastructure

Andromeda Cluster

FULLY_REMOTE

FULL_TIME

San Francisco

Posted Apr 30, 2026

Description

Senior Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco Â· Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster â but it filled almost instantly. Since then, weâve been quietly building the systems, network, and orchestration layer that makes the worldâs AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where itâs needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute â a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the worldâs financial markets.

We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

The Role

This is not a generalist SRE role.

You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems.

Weâre looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric â kernel â framework.

What Youâll Own

GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency.
Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time.
Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput.
Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations.
Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics.
Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management.
Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes.

What Weâre Looking For

GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation.
High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale.
Distributed Training & ML Frameworks:

Interview Prep Guide

AI-generated

Preparation Strategy

To prepare for this role, focus on reviewing GPU cluster architecture, distributed systems, and performance optimization techniques. Practice designing and discussing complex systems and trade-offs. Review your past experiences with GPU infrastructure, customer technical partnerships, and performance optimization. Prepare to discuss specific examples and scenarios. Use resources such as NVIDIA's GPU documentation, distributed systems literature, and system design blogs to improve your knowledge and skills.

Likely Interview Rounds

1. Technical~60 min
What to prep: Review GPU cluster architecture, distributed training, and performance optimization techniques. Be prepared to discuss your experience with GPU infrastructure and large-scale training workloads.
- How do you design a GPU cluster for large-scale distributed training?
- What are the common failure modes of distributed training and how do you mitigate them?
- How do you optimize the performance of a GPU cluster for training throughput and cost efficiency?
2. System design~90 min
What to prep: Review system design principles, GPU cluster architecture, and distributed systems. Practice designing and discussing complex systems and trade-offs.
- Design a system for multi-provider, multi-region GPU compute clusters
- How would you implement topology-aware scheduling, networking, and storage decisions for a large-scale GPU cluster?
- How do you ensure the reliability and performance of a GPU cluster in a production environment?
3. Behavioral~60 min
What to prep: Review your past experiences with GPU infrastructure, customer technical partnerships, and performance optimization. Prepare to discuss specific examples and scenarios.
- Tell me about a time when you had to troubleshoot a complex issue with a GPU cluster
- How do you handle customer technical partnerships and onboard new customers?
- Can you describe a situation where you had to optimize the performance of a GPU cluster for a specific workload?

Most Likely Questions

What are the challenges of running large-scale distributed training workloads on GPU infrastructure?
How do you ensure the reliability and performance of a GPU cluster in a production environment?
Can you describe your experience with GPU cluster architecture and distributed systems?
How do you optimize the performance of a GPU cluster for training throughput and cost efficiency?
What are some common failure modes of distributed training and how do you mitigate them?

Common Pitfalls

Lack of experience with GPU infrastructure and large-scale distributed training
Inadequate understanding of system design principles and distributed systems
Insufficient knowledge of performance optimization techniques for GPU clusters
Poor communication skills and inability to handle customer technical partnerships

Free Prep Resources

• NVIDIA's GPU documentation
• Distributed Systems literature (e.g. 'Designing Data-Intensive Applications' by Martin Kleppmann)
• System Design Primer (GitHub: donnemartin)
• Kubernetes documentation

Preparing interview guide...

Apply for this Position

Click below to apply directly on RemoteOK. Your application goes straight to the employer.

Job listing provided by RemoteOK
or

About Andromeda Cluster

Company scraped from remoteok

Job Stats

Views15

Applications0

Hiring Across Borders?

Hire this candidate legally in their country without setting up a local entity.

AsianCV may earn a referral fee

Get Paid in USD/EUR

Set up a multi-currency account to receive payments from international employers.

Senior Site Reliability Engineer AI Infrastructure

Description

Tags

Interview Prep Guide

Preparation Strategy

Likely Interview Rounds

Most Likely Questions

Common Pitfalls

Free Prep Resources

Apply for this Position

About Andromeda Cluster

Job Stats

Hiring Across Borders?

Get Paid in USD/EUR

Interview Prep Guide

Preparation Strategy

Likely Interview Rounds

Most Likely Questions

Common Pitfalls

Free Prep Resources