Unclaimed project

Are you a maintainer of pytorch? Claim this project to take control of your public changelog and roadmap.

Changelog

pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

pytorch/pytorch·

99k27kPythonNOASSERTION

·Website

autograddeep-learninggpumachine-learningneural-networknumpy+2

Last updated about 4 hours ago

Back to changelog

NewAugust 6, 2025

PyTorch 2.8.0 Release

PyTorch 2.8.0 Release Notes

Highlights
Backwards Incompatible Changes
Deprecations
New Features
Improvements
Bug fixes

More Python Projects

AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.

183.0k

Python

stable-diffusion-webui

Stable Diffusion web UI

162.1k

Python

transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

try:
    torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except RuntimeError:
    ...

try:
    torch.nn.Hardshrink()(torch.randint(0, 5, (10,)))
except NotImplementedError:
    ...

   class Func(torch.autograd.Function):
        @staticmethod
        def forward(ctx, inp):
            inp.add_(1)
            ctx.mark_dirty(inp)
            return inp

        @staticmethod
        def backward(ctx, gO):
            pass

    a = torch.tensor([1.0, 2.0], requires_grad=True)
    b = a.view_as(a)
    Func.apply(b)

Runs without error, but leaks memory

RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation

a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
# does not error, but gradients for c cannot be computed
torch.tensordot(a, b, dims=([1], [0]), out=c)

a = torch.empty((4, 2), requires_grad=True)
b = torch.empty((2, 4), requires_grad=True)
c = torch.empty((2, 2), requires_grad=True)
torch.tensordot(a, b, dims=([1], [0]), out=c)
# RuntimeError: tensordot(): the 'out' tensor was specified and requires gradients, and
# its shape does not match the expected result. Either remove the 'out' argument, ensure
# it does not require gradients, or make sure its shape matches the expected output.

import torch

embed = torch.randn(2, 8192)
x = torch.zeros(8192)

torch._dynamo.mark_dynamic(x, 0)

@torch.compile
def f(embedding_indices, x):
    added_tokens_mask = torch.where(x > 10000, 1, 0)
    ei = torch.narrow(embedding_indices, 1, 0, x.size(0))
    return ei.clone()

f(embed, x)

import torch

embed = torch.randn(2, 8192)
x = torch.zeros(8192)

torch._dynamo.maybe_mark_dynamic(x, 0)

@torch.compile
def f(embedding_indices, x):
    added_tokens_mask = torch.where(x > 10000, 1, 0)
    ei = torch.narrow(embedding_indices, 1, 0, x.size(0))
    return ei.clone()

f(embed, x)

import torch

@torch.compile(backend="eager")
def fn(x):
    return torch.cond(x.sum() > 0, lambda x: x, lambda x: x + 1, [x])

fn(torch.ones(3))

import torch

@torch.compile(backend="eager")
def fn(x):
    return torch.cond(x.sum() > 0, lambda x: x.clone(), lambda x: x + 1, [x])

fn(torch.ones(3))

from torch.fx.experimental.symbolic_shapes import definitely_false, definitely_true

...
if definitely_true(x):
  ...

if definitely_false(y):
  ...

from torch.fx.experimental.symbolic_shapes import guard_or_false, guard_or_true

...
if guard_or_false(x):
  ...

# alternatively: if guard_or_false(torch.sym_not(y))
if not guard_or_true(y):
  ...

import torch

...
exported_program = torch.export.export_for_inference(mod, args, kwargs)

import torch

...
exported_program = torch.export.export_for_training(
    mod, args, kwargs
).run_decompositions(decomp_table=decomp_table)

import torch

# default behavior is strict=True
torch.export.export(...)
torch.export.export_for_training(...)

import torch

# strict=True must be explicitly passed to get the old behavior
torch.export.export(..., strict=True)
torch.export.export_for_training(..., strict=True)

# opset_version=17
torch.onnx.export(...)

# To preserve the original behavior
torch.onnx.export(..., opset_version=17)

# New: opset_version=18
torch.onnx.export(...)

from torch.utils.dlpack import DLDeviceType

d1 = DLDeviceType.kDLGPU
d2 = DLDeviceType.kDLCPUPinned
...

from torch.utils.dlpack import DLDeviceType

d1 = DLDeviceType.kDLCUDA  # formerly kDLGPU
d2 = DLDeviceType.kDLCUDAHost  # formerly kDLCPUPinned
...

def forward(self, x: torch.Tensor) -> torch.Tensor:
    # Optionally use is_in_onnx_export to control the behavior during onnx export

    if torch.onnx.is_in_onnx_export():
        # Create a symbolic ONNX operator with the name "CustomOp" in the "custom_domain" domain.
        # The output tensor will have the specified dtype and shape
        return torch.onnx.ops.symbolic(
            "custom_domain::CustomOp",
            (x,),
            dict(attr_key="attr_value"),
            dtype=x.dtype,
            shape=x.shape,
            version=1,
        )
    else:
        return x

Enhanced TCPStore with clone and queuing features (#150966, #151045, #150969, #151485)
Added a collective time estimator for NCCL comms (#149343)
Made getDefaultBackend more fault tolerant without relying on exceptions (#149152)
Specified the default PyTorch Distributed backend for MPS (#149538)
Supported masterListenFd in TCPStoreLibUvBackend (#150215)
Used shared stores in gloo (#150230)
Improved FR dump robustness with all watchdog broadcast wait, reduce dump timeout and shrinked mutex range (#150652, #151329, #155949)
Added the record of each individual collective being coalesced in FR (#151238)
Implemented safer book-keeping of NCCL communicators (#150681)
Clarified behavior of TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK (#150682)
Registered also future allocations in mempool with NCCL (#150684)
Avoided computing global_rank when group_rank is used (#151373)
Exposed NCCL communicator from ProcessGroupNCCL via an unsafe API (#152496)
Added split sizes info dump for uneven all2all bw calculation (#151438)
Made FR vendor neutral so that other backends can use it and integrated into gloo. (#152585, #152563, #154929, #152614)
Added needs_contiguous_strides tag in functional collective (#153399, #153523)
Allowed split_group to work with non-nccl backends (#152175)
Simplified new_subgroups() by using new_subgroups_by_enumeration() (#153843)
Made only current thread allocate to pool in ProcessGroupNCCL (#153990)
Enabled using c10::Half for gloo (#153862)
Released GIL in PG destructor (#154976)
Enhanced get_process_group_ranks() to accept group=None (#154902)
Skipped updating the default device distributed backend if already registered (#155320)
Enabled querying the build and runtime NCCL versions (#156305)
Disabled NCCL NVLS when using deterministic mode (#156381)
Made init_process_group support index-only device id (#156214)
Support enabling / disabling NaN detector per-ProcessGroup (#151723)
Added support for reduce_scatter and ReduceOp::AVG in ProcessGroupGloo (#149781, #149869)
Added FP8 support in ProcessGroupNCCL (#152706)
Added ibverbs backend in gloo and enabled gloo CUDA when used with a backend that supports GPUDirect (#153015, #153425, #153406)

PyTorch 2.8.0 Release Notes

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

PyTorch 2.8.0 Release Notes

More Python Projects

AutoGPT

stable-diffusion-webui

transformers

Highlights

Tracked Regressions

Windows wheel builds with CUDA 12.9.1 stack overflow during build (#156181)

Backwards Incompatible Changes

CUDA Support

Removed support for Maxwell and Pascal architectures with CUDA 12.8 and 12.9 builds (#157517, #158478, #158744)

Python Frontend

Calling an op with an input dtype that is unsupported now raises NotImplementedError instead of RuntimeError (#155470)

Added missing in-place on view check to custom autograd.Function (#153094)

An error is now properly thrown for the out variant of tensordot when called with a requires_grad=True tensor (#150270)

torch.compile

Specialization of a tensor shape with mark_dynamic applied now correctly errors (#152661)

Several config variables related to torch.compile have been renamed or removed

Added a stricter aliasing/mutation check for HigherOrderOperators (e.g. cond), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953, #146658).

guard_or_x and definitely_x have been consolidated (#152463)

torch.export

torch.export.export_for_inference has been removed in favor of torch.export.export_for_training().run_decompositions() (#149078)

Switched default to strict=False in torch.export.export and export_for_training (#148790, #150941)

ONNX

Default opset in torch.onnx.export is now 18 (#156023)

The JitTraceConvertStrategy has been removed (#152556)

onnxscript>=0.3.1 is required for the dynamo=True option (#157017)

Build Frontend

Removed the torch/types.h include from Dispatcher.h (#149557)

Upgraded DLPack to 1.0 (#145000)

NVTX3 code has been moved from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583)

Deprecations

MPS support for MacOS Ventura will be removed in 2.9

torch.ao.quantization is deprecated and will be removed in 2.10 (#153892)

The dynamo=False (current default) option for torch.onnx.export is deprecated (#152478, #155580)

New Features

CUDA

torch.compile

Dynamo

Inductor

torch.export

Ahead-Of-Time Inductor (AOTI)

MPS

ONNX

Python Frontend

Quantization

XPU

Improvements

Build Frontend

Composability

C++ Frontend

CUDA

cuDNN

Distributed

c10d

DeviceMesh

DistributedDataParallel (DDP)

DTensor

FullyShardedDataParallel2 (FSDP2)

Pipeline Parallelism

ShardedTensor

TensorParallel

torchelastic

torch.compile

Dynamo

Inductor

torch.export

Ahead-Of-Time Inductor (AOTI)

FX

Linear Algebra Frontend

MPS

Nested Tensor (NJT)

torch.nn

ONNX

Optimizer

Calling an op with an input dtype that is unsupported now raises `NotImplementedError` instead of `RuntimeError` (#155470)

Added missing in-place on view check to custom `autograd.Function` (#153094)

An error is now properly thrown for the out variant of `tensordot` when called with a `requires_grad=True` tensor (#150270)

Specialization of a tensor shape with `mark_dynamic` applied now correctly errors (#152661)

Several config variables related to `torch.compile` have been renamed or removed

Added a stricter aliasing/mutation check for `HigherOrderOperator`s (e.g. `cond`), which will explicitly error out if alias/mutation among inputs and outputs is unsupported (#148953, #146658).

`guard_or_x` and `definitely_x` have been consolidated (#152463)

`torch.export.export_for_inference` has been removed in favor of `torch.export.export_for_training().run_decompositions()` (#149078)

Switched default to `strict=False` in `torch.export.export` and `export_for_training` (#148790, #150941)

Default opset in `torch.onnx.export` is now 18 (#156023)

The `JitTraceConvertStrategy` has been removed (#152556)

`onnxscript>=0.3.1` is required for the `dynamo=True` option (#157017)

Removed the `torch/types.h` include from `Dispatcher.h` (#149557)

Upgraded `DLPack` to 1.0 (#145000)

NVTX3 code has been moved from `cmake/public/cuda.cmake` to `cmake/Dependencies.cmake` (#151583)

`torch.ao.quantization` is deprecated and will be removed in 2.10 (#153892)

The `dynamo=False` (current default) option for `torch.onnx.export` is deprecated (#152478, #155580)