Monarch now supports running distributed training workloads on Kubernetes clusters. The new KubernetesJob API connects to pre-provisioned GPU pods managed by the https://github.com/meta-pytorch/monarch-kubernetes/ repository, enabling seamless multi-node DDP training
on Kubernetes.
Key Capabilities:
Connect to Kubernetes pods using KubernetesJob
Provision GPU workers via the MonarchMesh Custom Resource Definition
See the full tutorial: https://meta-pytorch.org/monarch/generated/examples/ddp/kubernetes_ddp.html
We also publish docker packages, see https://github.com/meta-pytorch/monarch/pkgs/container/monarch
monarch.spmd and monarch.job.spmd SPMDJob
The new monarch.job.spmd module provides serve() and run_spmd() for an interactive SPMD development workflow:
Reserve once, iterate many times: Allocate hosts once, then call run_spmd() repeatedly without reprovisioning
Remote debugging: Add breakpoint() in your training script and attach with monarch debug
Job caching: Reload cached job state and re-run on the same reserved hosts
Example:
from monarch.job.spmd import serve
job = serve(
["torchrun", "--nproc-per-node=4", "--standalone", "train.py"],
scheduler="local_cwd",
)
job.run_spmd()
# Later, reload and re-run without reprovisioning:
job = job_load(".monarch/job_state.pkl")
job.run_spmd()
This supports single-node training with command lists and multi-node training with TorchX AppDef on schedulers like Slurm.
See the example: https://meta-pytorch.org/monarch/generated/examples/ddp/spmd_job.html
Experimental Queue Dispatch Mode (Performance)
A new actor dispatch mode where Rust enqueues messages to a channel for Python to process, rather than Rust acquiring the GIL directly. This can improve throughput for message-heavy workloads.
from monarch.config import configure
configure(actor_queue_dispatch=True)
Real this_proc() for Local Spawning
The this_proc() function returns a handle to the current singleton process, enabling actors to spawn other actors locally. Remote actors can use this_proc() to spawn actors on their own host—enabling patterns like handing out references to a local proc and having
remote actors spawn resources on it.
from monarch.actor import Actor, endpoint, this_proc
class ManagerActor(Actor):
@endpoint
def spawn_helper(self) -> HelperActor:
# Spawns HelperActor in the same process as ManagerActor
return this_proc().spawn("helper", HelperActor)
Zero-Copy Messaging Path from Python
A new Buffer class enables zero-copy message serialization from Python. Large writes (≥256 bytes) are stored as references to Python bytes objects rather than being copied, integrating with multipart serialization for efficient vectored I/O.
from monarch._rust_bindings.monarch_hyperactor.buffers import Buffer
from monarch.config import configure
buffer = Buffer()
buffer.write(b"small") # copied into pending buffer
buffer.write(b"x" * 1000) # stored as zero-copy reference
# Configure the threshold via:
configure(small_write_threshold=256) # default
Principles of Ownership in Supervision
This release improves the supervision model for error handling in meshes, built on four core principles:
Owned meshes: Creating new meshes always results in an owned mesh
Single ownership: All meshes are owned by at most one actor (no transfer or suspension)
Lifecycle binding: A mesh cannot outlive its owner—when the owner dies, so does the mesh
Graceful cleanup: Stopped meshes drain pending messages before cleanup; owned meshes clean up before their owner
Actors can now implement supervise to handle failures from owned meshes.
Example:
class ManagerActor(Actor):
def __supervise__(self, failure: MeshFailure) -> bool:
logging.error(f"failure encountered: {failure}")
# Return truthy to handle, falsey to propagate
return None
See the documentation: https://meta-pytorch.org/monarch/actors.html#error-handling-in-meshes
SkyPilot Integration (Community Contribution)
SkyPilotJob enables running Monarch on Kubernetes and cloud VMs across 20+ cloud providers (AWS, GCP, Azure, CoreWeave, Nebius, etc.) via https://skypilot.readthedocs.io/.
import sky
from monarch_skypilot import SkyPilotJob
job = SkyPilotJob(
meshes={"trainers": 2},
resources=sky.Resources(accelerators="A100:1"),
cluster_name="my-monarch-cluster",
)
state = job.state()
trainers = state.trainers # HostMesh with 2 nodes