PyTorch 2.10.0 Release Notes

Highlights
Backwards Incompatible Changes
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation
Developers
Security

Highlights

Python 3.14 support for torch.compile(). Python 3.14t (freethreaded build) is experimentally supported as well.

Reduced kernel launch overhead with combo-kernels horizontal fusion in torchinductor

A new varlen_attn() op providing support for ragged and packed sequences

Efficient eigenvalue decompositions with DnXgeev

torch.compile() now respects use_deterministic_mode

DebugMode for tracking dispatched calls and debugging numerical divergence - This makes it simpler to track down subtle numerical bugs.

Intel GPUs support: Expand PyTorch support to the latest Panther Lake on Windows and Linux by enabling FP8 (core ops and scaled matmul) and complex MatMul support, and extending SYCL support in the C++ Extension API for Windows custom ops.

For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes

Dataloader Frontend

Removed unused data_source argument from Sampler (#163134). This is a no-op, unless you have a custom sampler that uses this argument. Please update your custom sampler accordingly.
Removed deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438). from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES, ShardingFilterIterDataPipe is no longer supported. Please import from torch.utils.data.datapipes.iter.sharding instead.

torch.nn

Remove Nested Jagged Tensor support from nn.attention.flex_attention (#161734)

ONNX

fallback=False is now the default in torch.onnx.export (#162726)
The exporter now uses the dynamo=True option without fallback. This is the recommended way to use the ONNX exporter. To preserve 2.9 behavior, manually set fallback=True in the torch.onnx.export call.

Release Engineering

Rename pytorch-triton package to triton (#169888)

Deprecations

Distributed

DeviceMesh
- Added a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993)

We decided to deprecate an existing behavior which goes against the PyTorch design principle (explicit over implicit) for device mesh slicing of flattened dim.

Version <2.9

import torch
from torch.distributed.device_mesh import

device_type = (
    acc.type
    if (acc := torch.accelerator.current_accelerator(check_available=True))
    else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
    device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)

mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"]  # This comes with no warning

Version >=2.10

import torch
from torch.distributed.device_mesh import

device_type = (
    acc.type
    if (acc := torch.accelerator.current_accelerator(check_available=True))
    else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
    device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)

mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"]  # This will come with a warning because it implicitly change the state of the original mesh. We will eventually remove this behavior in future release. User should do the bookkeeping of flattened mesh explicitly.

Ahead-Of-Time Inductor (AOTI)

Move from/to to torch::stable::detail (#164956)

JIT

torch.jit is not guaranteed to work in Python 3.14. Deprecation warnings have been added to user-facing torch.jit API (#167669).

torch.jit should be replaced with torch.compile or torch.export.

ONNX

The dynamic_axes option in torch.onnx.export is deprecated (#165769)

Users should supply the dynamic_shapes argument instead. See https://docs.pytorch.org/docs/stable/export.html#expressing-dynamism for more documentation.

Profiler

Deprecate export_memory_timeline method (#168036)

The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative.

New Features

Autograd

Allow setting grad_dtype on leaf tensors (#164751)
Add Default Autograd Fallback for PrivateUse1 in PyTorch (#165315)
Add API to annotate disjoint backward for use with torch.utils.checkpoint.checkpoint (#166536)

Complex Frontend

Add ComplexTensor subclass (#167621)

Composability

Support autograd in torch.cond (#165908)

cuDNN

BFloat16 support added to cuDNN RNN (#164411)
[cuDNN][submodule] Upgrade to cuDNN frontend 1.16.1 (#170591)

Distributed

LocalTensor:
- LocalTensor is a powerful debugging and simulation tool in PyTorch's distributed tensor ecosystem. It allows you to simulate distributed tensor computations across multiple SPMD (Single Program, Multiple Data) ranks on a single process. This is incredibly valuable for: 1) debugging distributed code without spinning up multiple processes; 2) understanding DTensor behavior by inspecting per-rank tensor states; 3) testing DTensor operations with uneven sharding across ranks; 4) rapid prototyping of distributed algorithms. Note that LocalTensor is designed for debugging purposes only. It has significant overhead and is not suitable for production distributed training.
- LocalTensor is a torch.Tensor subclass that internally holds a mapping from rank IDs to local tensor shards. When you perform a PyTorch operation on a LocalTensor, the operation is applied independently to each local shard, mimicking distributed computation (LocalTensor simulates collective operations locally without actual network communication.). LocalTensorMode is the context manager that enables LocalTensor dispatch. It intercepts PyTorch operations and routes them appropriately. The @maybe_run_for_local_tensor decorator is essential for handling rank-specific logic when implementing distributed code.
- To get started with LocalTensor, users import from torch.distributed._local_tensor, initialize a fake process group, and wrap their distributed code in a LocalTensorMode context. Within this context, DTensor operations automatically produce LocalTensors.
- PRs: (#164537, #166595, #168110,#168314,,)

Dynamo

torch.compile now fully works in Python 3.14 (#167384)
Add option to error or disable applying side effects (#167239)
Config flag (skip_fwd_side_effects_in_bwd_under_checkpoint) to allow eager and compile activation-checkpointing divergence for side-effects (#165775)
torch._higher_order_ops.print for enabling printing without graph breaks or reordering (#167571)

FX

Added node metadata annotation API
Disable preservation of node metadata when enable=False (#164772)
Annotation should be mapped across submod (#165202)
Annotate bw nodes before eliminate dead code (#165782)
Add logging for debugging annotation (#165797)
Override metadata on regenerated node in functional mode (#166200)
Skip copying custom meta for gradient accumulation nodes; tag with is_gradient_acc=True (#167572)
Add metadata hook for all nodes created in runtime_assert pass (#169497)
Update gm.print_readable to include Annotation (#165397)

Inductor

Add experimental Pallas TorchInductor backend. (#166822)
Add Pallas TPU backend support. (#167774)
Add Flash Attention support to FlexAttention. (#161118)
Add deterministic mode for Inductor compilation. (#163589) (#165950) (#164532)
Enable custom op autotune decompositions and parameter tuning. (#164212) (#167193)
Expose torch.compiler.config.force_disable_caches as a public API. (#166699)

Ahead-Of-Time Inductor (AOTI)

Integrate AOTI as a backend. (#167338)
Add AOTI mingw cross compilation for Windows. (#163188)

MPS

MPS sparse backend is functional (#162349, #162349, #162007, #162910, #162885, #163011, #163694, #164961, #165102, #166708, , , , , , , , , )

torch.nn

Add nn.functional.scaled_mm (#164142)
Add nn.functional.scaled_grouped_mm (#165154)
Add nn.attention.varlen_attn (#164502, #164504)
Add nn.functional.grouped_mm (#168298)

ONNX

A new testing module torch.onnx.testing with a testing utility assert_onnx_program (#162495)

Profiler

Add scope for RecordFunctionFast (#162661)

Quantization

Add _scaled_mm_v2 API (#164141)
Add scaled_grouped_mm_v2 and python API (#165154)
Add embedding_bag_byte_prepack_with_rowwise_min_max and embedding_bag_{2/4}bit_prepack_with_rowwise_min_max (#162924)
Add MXFP4 support for _scaled_grouped_mm_v2 via. FBGEMM kernels (#166530)

Release Engineering

Enabled auto-revert on PyTorch CI (#163858, #164911, #165459)
Add PEP 517 compliant Python source distribution package to release process (#157815)
Add Pallas CI testing infrastructure with CPU and GPU test (#167143, #167428, #169687, #169494, #169802)

ROCm

Enable grouped GEMM via regular GEMM fallback (#162419)
Enable grouped GEMM via CK (#166334, #167403)
Enable ATen GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Support torch.cuda._compile_kernel (#162510)
Enhanced Windows support
load_inline (#162577)
Enable AOTriton runtime compile (#165538)
AOTriton scaled_dot_product_attention (#162330)
Add gfx1150 gfx1151 to hipblaslt-supported GEMM lists (#164744)

XPU

Support ATen operators scaled_mm and scaled_mm_v2 for Intel GPU (#166056)
Support ATen operator _weight_int8pack_mm for Intel GPU (#160938)
Extend SYCL support in PyTorch CPP Extension API to allow users to implement new custom operators on Windows (#162579)
Add API torch.xpu.get_per_process_memory_fraction for Intel GPU (#165511)
Add API torch.xpu.set_per_process_memory_fraction for Intel GPU (#165510)
Add API torch.xpu.is_tf32_supported for Intel GPU (#163141)
Add API torch.xpu.can_device_access_peer for Intel GPU (#162705)
Add API torch.accelerator.get_memory_info for Intel GPU (#162564)

Improvements

Build Frontend

Abort explicitly requested CUDA build if toolkit could not be found (#166982)
RISC-V build improvements (#166602, #167071, #165717)
Allow building with arbitrary BLAS library (#166333)
Allow building with LeakSanitizer (#158686)

Composability

If you are using the torch.compile(backend="aot_eager") backend, it should now give bitwise equivalent results in eager. Previously it sometimes would not due to extra compile-only decompositions running (#165910)
Some dynamic shape errors were changed to recommend using torch._check over torch._check_is_size (#164889,
Some unbacked (dynamic shape) improvements (#162652, #169612)
Some bugfixes for symbolic float handling in compile (#166573, #162788)

C++ Frontend

Changed TORCH_CHECK_{COND} behavior to be non-fatal (#167004)
Migrated TypeTraits, TypeList, Metaprogramming, DeviceType, MemoryFormat, Layout, version.h, and CppTypeToScalarType to torch::headeronly (#167386, #163999, #168034, #165153, #164381, #167610)
Bumped libfmt submodule version to 12.0.0 (#163441)

CUDA

Make torch.cuda.rng_set_state and torch.cuda.rng_get_state work in CUDA graph capture. (#162505)
Enable templated kernels (#162875)
Enable pre-compiled kernels (#162972)
Add CUDA headers automatically (#162634)
Remove outdated header_code argument (#163165)
Prevent copies of std::vector in CUDA ForeachOps (#163416)
Implement cuda-python CUDA stream protocol (#163614)
Remove outdated checks and docs for cuBLAS determinism (#161749)
Cleanup old workaround code in ()

Distributed

c10d
- Added handling of discontiguous allgather/reducescatter inputs (#163712)
- Supported high stream for ProcessGroupXCCL (#163049)
Context Parallel
- Introduced ContextParallal plan for parallelize_module (#162542)
- Replaced context_parallel context manager with functional APIs (#164500)
- Introduced flex_cp_forward custom op for FlexAttention CP (#163185)
- Add _templated_ring_attention to the backward compatility stub (#166991)
- Added _LoadBalancer classes, and load-balance interface to Context Parallel APIs with process-time based Round-Robin load-balance (#161062, #163617)

Dynamo

Turn on capture_scalar_outputs and capture_dynamic_output_shape_ops when fullgraph=True (#163121, #163123)
Improved tracing for dict key hashing (#169204)
Tracing support for torch.cuda.stream (#166472)
Improved tracing of torch.autograd.Functions (#166788)
Miscellaneous smaller tracing support additions:
Extend collections.defaultdict support with *args, **kwargs and custom default_factory (#166793)
Support for bitwise xor (#166065)
Support on user-defined objects ()

Export

Improved fake tensor leakage detection in export (#163516)
Improved support for tensor subclasses (#163770)

FX

Add tensor subclass printing support in fx/graph.py (#164403)
Update Node.is_impure check if subgraph contains impure ops (#166609, #167443)
Explicitly remove call_mod_node_to_replace after inlining the submodule in const_fold._inline_module` (#166871)
Add strict argument validation to Interpreter.boxed_run (#166784)
Use stable topological sort in fuse_by_partitions (#167397)

Inductor

Pruned failed compilations from Autotuning candidates (#162673)
Extend triton_mm auto-tune options for HIM shapes (#163273)
Various fixes for AOTI-FX backend
Solve for undefined symbols in dynamic input shapes (#163044)
Support symbol and dynamic scalar graph inputs and outputs (#163596)
Support unbacked symbol definitions (#163729)
Generalize FloorDiv conversion to handle more complex launch grids. (#163828)
Don't flatten constant args (#166144)
Support SymInt placeholder(#167757)
Support torch.cond ()

MPS

Add embedding_bag operator (#163012, #163931, #163281)
Continue ops migration to Metal and add complex support ( #169478, #166903, #167755, #167826m #166216, #166670, #169407, , , , )

Nested Tensor (NJT)

Added NJT support for share_memory_ (#162272)

torch.nn

Support batch size 0 for flash attention in scaled_dot_product_attention (#166318)
Raise an error when using a sliced BlockMask in nn.functional.flex_attention (#164702)

ONNX

Improved graph capture logic to preserve dynamic shapes and improve conversion success rate
Cover all FX passes into backed size oblivious (#166151)
Set prefer_deferred_runtime_asserts_over_guards to True (#165820)
Various warning and error messages improvements (#162819, #163074, #166412, #166558, #166692)
Improved operator translation logic
Update weight tensor initialization in RMSNormalization (#166550)
Support enable_gqa when dropout is non-zero (#162771)

Optimizer

Make Adam, AdamW work with nonzero-dim Tensor betas (#149939)

Profiler

Expose Kineto event metadata in PyTorch Profiler events (#161624)
Add user_metadata display to memory visualizer (#165939)
Add warning for clearing profiler events at the end of each cycle (#168066)

Python Frontend

Improved torch.library and custom ops to support view functions (#164520)
Rework PyObject preservation to make it thread safe, significantly simpler and better handle some edge cases (#167564)
Remove reference cycle in torch.save to improve memory usage (#165204)
Add generator arg to rand*_like APIs (#166160)
support negative index arguments to torch.take_along_dim negative (#152161)

Quantization

half and bf16 support for fused_moving_avg_obs_fake_quant (#162620, #164175)
bf16 support for fake_quantize_learnable_per_channel_affine (#165098)
bf16 support for backward of torch._fake_quantize_learnable_per_tensor_affine (#165362)
Add NVFP4 two-level scaling to scaled_mm (#165774)
Add support for fp8_input/fp8_weight/bf16_bias and bf16_output for fp8 qconv in CPU (#167611)
Make the torch.float4_e2m1fn_x2 dtype support equality comparisons (#169575)
add copy_ support for torch.float4_e2m1fn_x2 dtype ()

Release Engineering

Add support for CUDA 13.0 in CI/CD including binary builds, inductor benchmarks, and upgrade to CUDA 13.0.2 (#162455, #162425, #163787, #164383, #164607, #163239, #165029, #168091, #169902, #163988)

ROCm

Allow custom OpenBLAS library name for CMake build (#166333)
Add gfx1150 gfx1151 to binary build targets (#164782, #164854, #164763)
hipSPARSELt support - Update cuda_to_hip_mappings.py (#167335)
New implementation of upsample_bilinear2d_backward (#164572)
Remove env var HIPBLASLT_ALLOW_TF32 from codebase, TF32 always allowed (#162998)
Enable multi-arch compilation and unit tests for AOT Inductor (#166357)
Fix miopen batchnorm changing output format (#162112)

Sparse Frontend

Add MPS support sparse_mask backward and sparse sum backward (#166260, #169240)
Add exp support for COO on CPU, CUDA and MPS (#166801)
Remove old CUDA 11 sparse code (#166048, #164531, #164199)

XPU

Support --nproc-per-node torchrun option for Intel GPU (#159474)
Support complex dtype of Aten operator Matmul for Intel GPU (#160867)
Add SYCL-TLA implementation for aten flash attention (#169101)

Bug Fixes

Autograd

Fix custom autograd Function memory leak when saving mutated view (#164407)
Fix unused gradient tracking to respect create_graph (#168295)
Fix NaN gradients in atan2_backward when both inputs are zero (#166787)
Bugfix to forward autodiff causing different datatype 2 (#165784)

Build Frontend

Fix build targets order (#169905,#169994, #164165)
Do not restrict optimization flags (#164894)
Fix linking issue for Linux-aarch64 target (#169723)

C++ Frontend

Fixed C++ extension distributed warning spew (#162764)

CPU

Fix clang-21 warnings (#166859)

CUDA

Handle python floats as double in CUDA C++ (#162626)
Use libnvrtc.so path based on CUDA version used by torch (#163642)
Handle python floats as double in CUDA C++ (#162626)
Use libnvrtc.so path based on CUDA version used by torch (#163642)
Fix torch.nonzero_static crash on CUDA when the input is a empty tensor (#162578)
Fix caller source location in C10_CUDA_CHECK error messages (#162808)
Fix channels-last dimension mapping in CUDA parallel_cat (#165023)
64-bit indexing on CUDA:
- Fix a large tensor indexding crash (#164049)

cuDNN

Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ due to a numerical isssue (#163581)

Dataloader Frontend

Fix pin memory return type when input is a tuple (#169690)

Distributed

c10d
- Enforced P2P tensors to be dense (#163719)
- Fixed split_group bug by having the parent pg option deep copied (#167125)
- Fixed ProcessGroupNCCL coalseced profiling (#160680)
Context Parallel
- Fixed cuDNN Context Parallel LSE dimension bug (#163231)
DistributedDataParallel: (DDP)
- Fixed complex datatype handling in ddp (#166863)
DistributedStateDict
- Fixed keyerror when loading parameter with unsaved optimizer state (#165228)
DTensor
- Fixed foreach_max op (#169667)

Distributed Checkpointing

Avoid multiple storage writer resets in async save (#159448)
DTensor slice dequantization with proper block alignment (#163532)
Add option to use PrefixStore to create checkpoint background process (#166560)

Dynamo

Fixed cProfile usage with torch.compile in Python 3.12+ (#170013)
Fix memory leak in tensor subclass metadata guard (#167352)

FX

Fix splitter for empty subgraph case (#161716)
Use tuples to have a deterministic ordering in shape prop. (#164851)

Inductor

Fix some edge cases (#162295)
Fix TMA transpose logic to handle 1D shapes + string differences (#163966)
fix flex attention eager: dont round down scores to low-precision (closes #163588) (#163986)
Fix a condition error in torch/_inductor/codegen/debug_utils.py (#165033)
Thread deterministic config vars to subproc compilation (#165729)
Fix identity expansion. (#165066)
Fix FP8 activation quantization for duplicate forward outputs. (#163364)
Fix decomposition issues (repeat_interleave out-of-bounds indices, divmod error, alpha/beta handling). (#165368) () ()

Ahead-Of-Time Inductor (AOTI)

Bugfix for doing negative padding (#161639)
Fix unbounded number of substitutions when equality checks contain Max expr (#163685)
Use atomic API when trying to apply size hints to input tensor strides. (#163660)
Fix a mixed-device bug for scatter_add (#167341)
Fix a small buffer mutation issue (#169347)
Fix aot_compile typing. (#168320)

MPS

Fix empty tensors handling for median/nanmedian/mv, dot (#162846, #166561), #165237)
Fix dlpack exports/imports of sliced tensors (#169272)
Fix large tensors silent correctness for fill and cat operation (#164108, #165373, #166556, #164416)
torch.compile bugfixes (#169648, , , )

Nested Tensor (NJT)

Fixed NJT min / max operations on integer dtypes (#162273)

torch.nn

Fix silent correctness when backpropagating to score_mod in nn.functional.flex_attention (#163677)
Fix bug in nn.Module.load_state_dict for singleton tensor (#166335)

ONNX

Native ONNX ops (torch.onnx.ops)
Fix rotary_embedding_23 implementation (#162865)
Create fake implementations for onnx ops; fix boolean mask in attention (#165780)
Fix onnx export on big endian machines (#167816)

Optimizer

Fix SWALR.state_dict and load_state_dict to serialize properly with weights_only=True (#163122)
Prevent problematic tensor aliasing in LRScheduler (#163098, #163120)
Fix LBFGS wolfe max iteration (#161488)

Profiler

Fix ProfilerState typo ('Disable' → 'Disabled') and expose PRIVATEUSE1 in ActiveProfilerType (#169166)

ROCm

Fix hardsigmoid op (#162758)
Fix GEMM carveout feature (#164303)
Disable __builtin_amdgcn_rcpf for gfx90a (#166454)
ROCm 7.0 BC-breaking preparations in JIT support (#160587, #166147)

Sparse Frontend

Fix mul(COO, COO) on MPS for hybrid COO variants (#166164)
Update torch.sparse_coo_tensor error message to include more information about input tensor properties (#161900)
Fix GradTrackingTensor sparse layout propagation (#165765)

XPU

Fix OneDNN deconvolution with output_padding on Intel GPU (#169176)
Fix conv1d precision error on Intel GPU (#162944)
Fix incorrect FLOPs counting of convolution_overrideable on Intel GPU(#166839)
Fix performance drop in AOTI on Intel GPU (#163315)

Performance

Benchmark

Add attention benchmarking numbers to pytorch operator microbenchmarks (#164155)

CPU (AArch64)

Improved aarch64 performance with optimizations for type conversions (bfloat16, FP16, bool), erf function, and autovectorization enhancements (#166049, #166262, #166306, #166330, #166594, #166641, #166739, #166880, #166958)

CUDA

Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval) (#166715)
Reduce register pressure in radix_sort_pairs to improve torch.sort performance (#167094)
Add Flash Attention 4 to sdpa (#167348)
Vectorize stores in cat for all dtypes on CUDA (#162440)
Expose pinned_reserve_segment_size_mb to speed up pinned memory allocation (#164501)
torch.topk: refactor global histogram/cumsum into a dedicated kernel to improve performance on CUDA (#164459)
Vectorize 8 elements on 16=bit data types for sum/mean to improve performance (#165055)
Switch order of blocked reduce when vectorize loads to improve performance (#165178)

cuDNN

Reenable cuDNN for 64-bit depthwise convolutions (#168364)

Distributed Checkpointing

Add timeout for checkpoint background process join (#162828)
Disable GC in process based async checkpointing (#169613)
Optimize global save-plan validation (#166820)
state dict staging fixes (#166025)

Dynamo

Faster tracing of some pytree functions (#168342)

FX

Move Node._prepend/Node._remove_from_list to C++ (#165882)
Optimize torch.fx.Node.replace_all_uses_with (#165889)

Inductor

Naive foreach autotune support (#162053)
Invert unary read and write for better fusion. (#161404)
Generate fused RMS/layer norm backward. (#165370)
Optimize cold compile time when cudagraphs-partition is enabled. (#167132)
Reduce cold compilation time caused by duplicated user-defined Triton kernels. (#168292)
Optimize identity permute in empty_permuted decomposition. (#169731)
Properly enlarge XBLOCK/set num_warps=1 for B200 inner persistent reductions. (#168335)
Improved heuristic for operator reordering for peak memory. (#161810)

Quantization

Make prepare and convert faster by caching (#162550)
Add onednn context cache for CPU qlinear to improve performance (#168150)

Release Engineering

Add operator microbenchmarks for attention, convolution, and optimizer operations to CI (#165915, #166331, #168101)
Add HuggingFace LLM benchmarks and cleanup benchmark model configurations (#156967, #164815, #164816)

ROCm

Use hipSolver instead of MAGMA for Cholesky (#163977)
Layer norm now uses __builtin_amdgcn_rcpf(x) instead of 1.f/x (#165589)
OffsetCalc Unroll Optimization (#161700)
Improve perf for elementwise broadcast with mixed dtype (#163562)
Implement float32 copy kernel (#163869)
Improve non stride-one backwards indexing for small index sets (#164409)
Adjust grid size for non-unit stride backwards indexing (#165026)
Normalization update to block size (#165941)
Deserialize loads in planer sum portion of reduce() of norm. ()

torch.func

20x less memory use and 37.25% speedup in min_cut_rematerialization_partition when using the new dp knapsack solver, compared to existing default one (dp) (#160914)

Documentation

Autograd

Add inference_mode hint message to use eval with inference. (#163619)

CUDA

Add Documentation for Device APIs (#162834)
Adding aliases for CUDA and XPU API documentation (#162984)
Clarify safety of CUDA graph memory pool sharing across graphs in documentation (#166975)

Distributed

c10d
- Complete documentations for all distributed c10d apis (#165194)

Dynamo

Updated documentation for tlparse (#171339). tlparse is a compilation report tool that processes TORCH_TRACE logs to generate interactive HTML reports showing how your model was compiled. When reporting bugs to PyTorch developers, we encourage you to attach the trace log or tlparse output to provide critical debugging information to help us bisect the issue.

FX

Add docs for torch.fx.experimental.unification (#167334)
Fix the split_module tutorial code (#166154)

Inductor

Updated documentation for tlparse (#171339) (#162975). tlparse is a compilation report tool that processes TORCH_TRACE logs to generate interactive HTML reports showing how your model was compiled. When reporting bugs to PyTorch developers, we encourage you to attach the trace log or tlparse output to provide critical debugging information to help us bisect the issue.
Update FlexConfig documentation. (#162533)

Ahead-Of-Time Inductor (AOTI)

[AOTI] Update AOTInductor tutorial (#163808)

torch.nn

Update CTCLoss docs float32 input required for CuDNN (#162042)
Update LPPool docs to clarify ceil_mode padding semantics when ceil_mode=True (#163186)

ONNX

Update export docstring (#162622)
Fix incorrect attention example in ONNX exporter docstring (#167646)

Profiler

Add documentation for FunctionEvent (#167688)

Quantization

Document some quantization public apis (#165160)
Add missing method docstrings for pytorch quantization classes (#165199)

XPU

Add new supported client GPU Panther Lake in "Get Started with XPU" page (#170517)

Security

Developers

Composability

Removed guard_size_oblivious from internal code, replacing most usages with guard_if_{false|true}. Both APIs are used in framework code that gets traced through to make it more friendly to unbacked symints, but the new APIs are more intuitive (#164664, #164665, #167232)

Distributed

c10d
- Added TCPStore based debug page and fr trace analysis with py-spy support (#169095, #169144, #169147, #167871)
- Modernized c10d code base with python code older than 3.10 removed (#163613, #163456, #163440, #167173)
- Enabled FlightRecorder for torchft with dynamic dumping path and a reset API (#164752, , , , ,)

FX

Refactor proxy_tensor (#165266)
Fix invalid symbol definition emitted in fx_graph_runnable.py (#166529)
Add debug-level logging to Interpreter.run_node (#117351) (#166622)
Fix an unsafe indexing in fx exception handling (#169140)
Type annotations for torch/_higher_order_ops/flat_apply.py (#168933)
Add recompute tags (from AC) into GraphModule.print_readable() by default (#167735)
Apply ruff UP035 rule (#165214, #163744)
Add model code stack trace to cuda.memory._snapshot (#166676)

Inductor

Add API for scheduling overlap from inductor configs. (#169693)
Make LOCK_TIMEOUT in codecache configurable. (#165030)
Add debug output for specific pattern matching. (#169603)
Add overridable env var for disabling FX graph cache. (#166138)
Add subsystem support to pattern matcher. (#163922)
Add pre-grad graph bisecting support. (#166344)
Decouple flags for optimization and debug symbols. (#167385) (#167575)
Introduce HOP for inductor compiled regions to allow torch dispatch. (#167844)

Release Engineering

Migrate from setup.py to modern Python build tools (pip install and python -m build) (#156711, #156712)

ROCm

Add Rocm to Operator Microbenchmark CI (#164173)
Enable TD for all ROCm default and distributed config workflows (#168225)
Expand trunk.yml coverage for ROCm (#168162)
cudagraph trees ut fixes (#163592)
test_convolution.py uses miopen immediate mode (#164598)
Keep amdgpu-coerce-illegal-types flag if rocm version is less than 7.2 (#165789)
Use a ROCm version string without hash. (#166336)
Dynamo benchmarks: remove outdated flaky models and enable deterministic algorithms (#169024)

XPU

Upgrade Intel GPU software stack package to intel-deep-learning-essentials-2025.3 (#166829)

PyTorch 2.10.0 Release Notes

Highlights
Backwards Incompatible Changes
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation
Developers
Security

Highlights

Python 3.14 support for torch.compile(). Python 3.14t (freethreaded build) is experimentally supported as well.

Reduced kernel launch overhead with combo-kernels horizontal fusion in torchinductor

A new varlen_attn() op providing support for ragged and packed sequences

Efficient eigenvalue decompositions with DnXgeev

torch.compile() now respects use_deterministic_mode

DebugMode for tracking dispatched calls and debugging numerical divergence - This makes it simpler to track down subtle numerical bugs.

For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes

Dataloader Frontend

Removed unused data_source argument from Sampler (#163134). This is a no-op, unless you have a custom sampler that uses this argument. Please update your custom sampler accordingly.
Removed deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438). from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES, ShardingFilterIterDataPipe is no longer supported. Please import from torch.utils.data.datapipes.iter.sharding instead.

torch.nn

Remove Nested Jagged Tensor support from nn.attention.flex_attention (#161734)

ONNX

fallback=False is now the default in torch.onnx.export (#162726)
The exporter now uses the dynamo=True option without fallback. This is the recommended way to use the ONNX exporter. To preserve 2.9 behavior, manually set fallback=True in the torch.onnx.export call.

Release Engineering

Rename pytorch-triton package to triton (#169888)

Deprecations

Distributed

DeviceMesh
- Added a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993)

We decided to deprecate an existing behavior which goes against the PyTorch design principle (explicit over implicit) for device mesh slicing of flattened dim.

Version <2.9

import torch
from torch.distributed.device_mesh import

device_type = (
    acc.type
    if (acc := torch.accelerator.current_accelerator(check_available=True))
    else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
    device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)

mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"]  # This comes with no warning

Version >=2.10

import torch
from torch.distributed.device_mesh import

device_type = (
    acc.type
    if (acc := torch.accelerator.current_accelerator(check_available=True))
    else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
    device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)

mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"]  # This will come with a warning because it implicitly change the state of the original mesh. We will eventually remove this behavior in future release. User should do the bookkeeping of flattened mesh explicitly.

Ahead-Of-Time Inductor (AOTI)

Move from/to to torch::stable::detail (#164956)

JIT

torch.jit is not guaranteed to work in Python 3.14. Deprecation warnings have been added to user-facing torch.jit API (#167669).

torch.jit should be replaced with torch.compile or torch.export.

ONNX

The dynamic_axes option in torch.onnx.export is deprecated (#165769)

Users should supply the dynamic_shapes argument instead. See https://docs.pytorch.org/docs/stable/export.html#expressing-dynamism for more documentation.

Profiler

Deprecate export_memory_timeline method (#168036)

New Features

Autograd

Allow setting grad_dtype on leaf tensors (#164751)
Add Default Autograd Fallback for PrivateUse1 in PyTorch (#165315)
Add API to annotate disjoint backward for use with torch.utils.checkpoint.checkpoint (#166536)

Complex Frontend

Add ComplexTensor subclass (#167621)

Composability

Support autograd in torch.cond (#165908)

cuDNN

BFloat16 support added to cuDNN RNN (#164411)
[cuDNN][submodule] Upgrade to cuDNN frontend 1.16.1 (#170591)

Distributed

LocalTensor:
- LocalTensor is a powerful debugging and simulation tool in PyTorch's distributed tensor ecosystem. It allows you to simulate distributed tensor computations across multiple SPMD (Single Program, Multiple Data) ranks on a single process. This is incredibly valuable for: 1) debugging distributed code without spinning up multiple processes; 2) understanding DTensor behavior by inspecting per-rank tensor states; 3) testing DTensor operations with uneven sharding across ranks; 4) rapid prototyping of distributed algorithms. Note that LocalTensor is designed for debugging purposes only. It has significant overhead and is not suitable for production distributed training.
- LocalTensor is a torch.Tensor subclass that internally holds a mapping from rank IDs to local tensor shards. When you perform a PyTorch operation on a LocalTensor, the operation is applied independently to each local shard, mimicking distributed computation (LocalTensor simulates collective operations locally without actual network communication.). LocalTensorMode is the context manager that enables LocalTensor dispatch. It intercepts PyTorch operations and routes them appropriately. The @maybe_run_for_local_tensor decorator is essential for handling rank-specific logic when implementing distributed code.
- To get started with LocalTensor, users import from torch.distributed._local_tensor, initialize a fake process group, and wrap their distributed code in a LocalTensorMode context. Within this context, DTensor operations automatically produce LocalTensors.
- PRs: (#164537, #166595, #168110,#168314,,)

Dynamo

torch.compile now fully works in Python 3.14 (#167384)
Add option to error or disable applying side effects (#167239)
Config flag (skip_fwd_side_effects_in_bwd_under_checkpoint) to allow eager and compile activation-checkpointing divergence for side-effects (#165775)
torch._higher_order_ops.print for enabling printing without graph breaks or reordering (#167571)

FX

Added node metadata annotation API
Disable preservation of node metadata when enable=False (#164772)
Annotation should be mapped across submod (#165202)
Annotate bw nodes before eliminate dead code (#165782)
Add logging for debugging annotation (#165797)
Override metadata on regenerated node in functional mode (#166200)
Skip copying custom meta for gradient accumulation nodes; tag with is_gradient_acc=True (#167572)
Add metadata hook for all nodes created in runtime_assert pass (#169497)
Update gm.print_readable to include Annotation (#165397)

Inductor

Add experimental Pallas TorchInductor backend. (#166822)
Add Pallas TPU backend support. (#167774)
Add Flash Attention support to FlexAttention. (#161118)
Add deterministic mode for Inductor compilation. (#163589) (#165950) (#164532)
Enable custom op autotune decompositions and parameter tuning. (#164212) (#167193)
Expose torch.compiler.config.force_disable_caches as a public API. (#166699)

Ahead-Of-Time Inductor (AOTI)

Integrate AOTI as a backend. (#167338)
Add AOTI mingw cross compilation for Windows. (#163188)

MPS

MPS sparse backend is functional (#162349, #162349, #162007, #162910, #162885, #163011, #163694, #164961, #165102, #166708, , , , , , , , , )

torch.nn

Add nn.functional.scaled_mm (#164142)
Add nn.functional.scaled_grouped_mm (#165154)
Add nn.attention.varlen_attn (#164502, #164504)
Add nn.functional.grouped_mm (#168298)

ONNX

A new testing module torch.onnx.testing with a testing utility assert_onnx_program (#162495)

Profiler

Add scope for RecordFunctionFast (#162661)

Quantization

Add _scaled_mm_v2 API (#164141)
Add scaled_grouped_mm_v2 and python API (#165154)
Add embedding_bag_byte_prepack_with_rowwise_min_max and embedding_bag_{2/4}bit_prepack_with_rowwise_min_max (#162924)
Add MXFP4 support for _scaled_grouped_mm_v2 via. FBGEMM kernels (#166530)

Release Engineering

Enabled auto-revert on PyTorch CI (#163858, #164911, #165459)
Add PEP 517 compliant Python source distribution package to release process (#157815)
Add Pallas CI testing infrastructure with CPU and GPU test (#167143, #167428, #169687, #169494, #169802)

ROCm

Enable grouped GEMM via regular GEMM fallback (#162419)
Enable grouped GEMM via CK (#166334, #167403)
Enable ATen GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Support torch.cuda._compile_kernel (#162510)
Enhanced Windows support
load_inline (#162577)
Enable AOTriton runtime compile (#165538)
AOTriton scaled_dot_product_attention (#162330)
Add gfx1150 gfx1151 to hipblaslt-supported GEMM lists (#164744)

XPU

Support ATen operators scaled_mm and scaled_mm_v2 for Intel GPU (#166056)
Support ATen operator _weight_int8pack_mm for Intel GPU (#160938)
Extend SYCL support in PyTorch CPP Extension API to allow users to implement new custom operators on Windows (#162579)
Add API torch.xpu.get_per_process_memory_fraction for Intel GPU (#165511)
Add API torch.xpu.set_per_process_memory_fraction for Intel GPU (#165510)
Add API torch.xpu.is_tf32_supported for Intel GPU (#163141)
Add API torch.xpu.can_device_access_peer for Intel GPU (#162705)
Add API torch.accelerator.get_memory_info for Intel GPU (#162564)

Improvements

Build Frontend

Abort explicitly requested CUDA build if toolkit could not be found (#166982)
RISC-V build improvements (#166602, #167071, #165717)
Allow building with arbitrary BLAS library (#166333)
Allow building with LeakSanitizer (#158686)

Composability

If you are using the torch.compile(backend="aot_eager") backend, it should now give bitwise equivalent results in eager. Previously it sometimes would not due to extra compile-only decompositions running (#165910)
Some dynamic shape errors were changed to recommend using torch._check over torch._check_is_size (#164889,
Some unbacked (dynamic shape) improvements (#162652, #169612)
Some bugfixes for symbolic float handling in compile (#166573, #162788)

C++ Frontend

Changed TORCH_CHECK_{COND} behavior to be non-fatal (#167004)
Migrated TypeTraits, TypeList, Metaprogramming, DeviceType, MemoryFormat, Layout, version.h, and CppTypeToScalarType to torch::headeronly (#167386, #163999, #168034, #165153, #164381, #167610)
Bumped libfmt submodule version to 12.0.0 (#163441)

CUDA

Make torch.cuda.rng_set_state and torch.cuda.rng_get_state work in CUDA graph capture. (#162505)
Enable templated kernels (#162875)
Enable pre-compiled kernels (#162972)
Add CUDA headers automatically (#162634)
Remove outdated header_code argument (#163165)
Prevent copies of std::vector in CUDA ForeachOps (#163416)
Implement cuda-python CUDA stream protocol (#163614)
Remove outdated checks and docs for cuBLAS determinism (#161749)
Cleanup old workaround code in ()

Distributed

c10d
- Added handling of discontiguous allgather/reducescatter inputs (#163712)
- Supported high stream for ProcessGroupXCCL (#163049)
Context Parallel
- Introduced ContextParallal plan for parallelize_module (#162542)
- Replaced context_parallel context manager with functional APIs (#164500)
- Introduced flex_cp_forward custom op for FlexAttention CP (#163185)
- Add _templated_ring_attention to the backward compatility stub (#166991)
- Added _LoadBalancer classes, and load-balance interface to Context Parallel APIs with process-time based Round-Robin load-balance (#161062, #163617)

Dynamo

Turn on capture_scalar_outputs and capture_dynamic_output_shape_ops when fullgraph=True (#163121, #163123)
Improved tracing for dict key hashing (#169204)
Tracing support for torch.cuda.stream (#166472)
Improved tracing of torch.autograd.Functions (#166788)
Miscellaneous smaller tracing support additions:
Extend collections.defaultdict support with *args, **kwargs and custom default_factory (#166793)
Support for bitwise xor (#166065)
Support on user-defined objects ()

Export

Improved fake tensor leakage detection in export (#163516)
Improved support for tensor subclasses (#163770)

FX

Add tensor subclass printing support in fx/graph.py (#164403)
Update Node.is_impure check if subgraph contains impure ops (#166609, #167443)
Explicitly remove call_mod_node_to_replace after inlining the submodule in const_fold._inline_module` (#166871)
Add strict argument validation to Interpreter.boxed_run (#166784)
Use stable topological sort in fuse_by_partitions (#167397)

Inductor

Pruned failed compilations from Autotuning candidates (#162673)
Extend triton_mm auto-tune options for HIM shapes (#163273)
Various fixes for AOTI-FX backend
Solve for undefined symbols in dynamic input shapes (#163044)
Support symbol and dynamic scalar graph inputs and outputs (#163596)
Support unbacked symbol definitions (#163729)
Generalize FloorDiv conversion to handle more complex launch grids. (#163828)
Don't flatten constant args (#166144)
Support SymInt placeholder(#167757)
Support torch.cond ()

MPS

Add embedding_bag operator (#163012, #163931, #163281)
Continue ops migration to Metal and add complex support ( #169478, #166903, #167755, #167826m #166216, #166670, #169407, , , , )

Nested Tensor (NJT)

Added NJT support for share_memory_ (#162272)

torch.nn

Support batch size 0 for flash attention in scaled_dot_product_attention (#166318)
Raise an error when using a sliced BlockMask in nn.functional.flex_attention (#164702)

ONNX

Improved graph capture logic to preserve dynamic shapes and improve conversion success rate
Cover all FX passes into backed size oblivious (#166151)
Set prefer_deferred_runtime_asserts_over_guards to True (#165820)
Various warning and error messages improvements (#162819, #163074, #166412, #166558, #166692)
Improved operator translation logic
Update weight tensor initialization in RMSNormalization (#166550)
Support enable_gqa when dropout is non-zero (#162771)

Optimizer

Make Adam, AdamW work with nonzero-dim Tensor betas (#149939)

Profiler

Expose Kineto event metadata in PyTorch Profiler events (#161624)
Add user_metadata display to memory visualizer (#165939)
Add warning for clearing profiler events at the end of each cycle (#168066)

Python Frontend

Improved torch.library and custom ops to support view functions (#164520)
Rework PyObject preservation to make it thread safe, significantly simpler and better handle some edge cases (#167564)
Remove reference cycle in torch.save to improve memory usage (#165204)
Add generator arg to rand*_like APIs (#166160)
support negative index arguments to torch.take_along_dim negative (#152161)

Quantization

half and bf16 support for fused_moving_avg_obs_fake_quant (#162620, #164175)
bf16 support for fake_quantize_learnable_per_channel_affine (#165098)
bf16 support for backward of torch._fake_quantize_learnable_per_tensor_affine (#165362)
Add NVFP4 two-level scaling to scaled_mm (#165774)
Add support for fp8_input/fp8_weight/bf16_bias and bf16_output for fp8 qconv in CPU (#167611)
Make the torch.float4_e2m1fn_x2 dtype support equality comparisons (#169575)
add copy_ support for torch.float4_e2m1fn_x2 dtype ()

Release Engineering

Add support for CUDA 13.0 in CI/CD including binary builds, inductor benchmarks, and upgrade to CUDA 13.0.2 (#162455, #162425, #163787, #164383, #164607, #163239, #165029, #168091, #169902, #163988)

ROCm

Allow custom OpenBLAS library name for CMake build (#166333)
Add gfx1150 gfx1151 to binary build targets (#164782, #164854, #164763)
hipSPARSELt support - Update cuda_to_hip_mappings.py (#167335)
New implementation of upsample_bilinear2d_backward (#164572)
Remove env var HIPBLASLT_ALLOW_TF32 from codebase, TF32 always allowed (#162998)
Enable multi-arch compilation and unit tests for AOT Inductor (#166357)
Fix miopen batchnorm changing output format (#162112)

Sparse Frontend

Add MPS support sparse_mask backward and sparse sum backward (#166260, #169240)
Add exp support for COO on CPU, CUDA and MPS (#166801)
Remove old CUDA 11 sparse code (#166048, #164531, #164199)

XPU

Support --nproc-per-node torchrun option for Intel GPU (#159474)
Support complex dtype of Aten operator Matmul for Intel GPU (#160867)
Add SYCL-TLA implementation for aten flash attention (#169101)

Bug Fixes

Autograd

Fix custom autograd Function memory leak when saving mutated view (#164407)
Fix unused gradient tracking to respect create_graph (#168295)
Fix NaN gradients in atan2_backward when both inputs are zero (#166787)
Bugfix to forward autodiff causing different datatype 2 (#165784)

Build Frontend

Fix build targets order (#169905,#169994, #164165)
Do not restrict optimization flags (#164894)
Fix linking issue for Linux-aarch64 target (#169723)

C++ Frontend

Fixed C++ extension distributed warning spew (#162764)

CPU

Fix clang-21 warnings (#166859)

CUDA

Handle python floats as double in CUDA C++ (#162626)
Use libnvrtc.so path based on CUDA version used by torch (#163642)
Handle python floats as double in CUDA C++ (#162626)
Use libnvrtc.so path based on CUDA version used by torch (#163642)
Fix torch.nonzero_static crash on CUDA when the input is a empty tensor (#162578)
Fix caller source location in C10_CUDA_CHECK error messages (#162808)
Fix channels-last dimension mapping in CUDA parallel_cat (#165023)
64-bit indexing on CUDA:
- Fix a large tensor indexding crash (#164049)

cuDNN

Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ due to a numerical isssue (#163581)

Dataloader Frontend

Fix pin memory return type when input is a tuple (#169690)

Distributed

c10d
- Enforced P2P tensors to be dense (#163719)
- Fixed split_group bug by having the parent pg option deep copied (#167125)
- Fixed ProcessGroupNCCL coalseced profiling (#160680)
Context Parallel
- Fixed cuDNN Context Parallel LSE dimension bug (#163231)
DistributedDataParallel: (DDP)
- Fixed complex datatype handling in ddp (#166863)
DistributedStateDict
- Fixed keyerror when loading parameter with unsaved optimizer state (#165228)
DTensor
- Fixed foreach_max op (#169667)

Distributed Checkpointing

Avoid multiple storage writer resets in async save (#159448)
DTensor slice dequantization with proper block alignment (#163532)
Add option to use PrefixStore to create checkpoint background process (#166560)

Dynamo

Fixed cProfile usage with torch.compile in Python 3.12+ (#170013)
Fix memory leak in tensor subclass metadata guard (#167352)

FX

Fix splitter for empty subgraph case (#161716)
Use tuples to have a deterministic ordering in shape prop. (#164851)

Inductor

Fix some edge cases (#162295)
Fix TMA transpose logic to handle 1D shapes + string differences (#163966)
fix flex attention eager: dont round down scores to low-precision (closes #163588) (#163986)
Fix a condition error in torch/_inductor/codegen/debug_utils.py (#165033)
Thread deterministic config vars to subproc compilation (#165729)
Fix identity expansion. (#165066)
Fix FP8 activation quantization for duplicate forward outputs. (#163364)
Fix decomposition issues (repeat_interleave out-of-bounds indices, divmod error, alpha/beta handling). (#165368) () ()

Ahead-Of-Time Inductor (AOTI)

Bugfix for doing negative padding (#161639)
Fix unbounded number of substitutions when equality checks contain Max expr (#163685)
Use atomic API when trying to apply size hints to input tensor strides. (#163660)
Fix a mixed-device bug for scatter_add (#167341)
Fix a small buffer mutation issue (#169347)
Fix aot_compile typing. (#168320)

MPS

Fix empty tensors handling for median/nanmedian/mv, dot (#162846, #166561), #165237)
Fix dlpack exports/imports of sliced tensors (#169272)
Fix large tensors silent correctness for fill and cat operation (#164108, #165373, #166556, #164416)
torch.compile bugfixes (#169648, , , )

Nested Tensor (NJT)

Fixed NJT min / max operations on integer dtypes (#162273)

torch.nn

Fix silent correctness when backpropagating to score_mod in nn.functional.flex_attention (#163677)
Fix bug in nn.Module.load_state_dict for singleton tensor (#166335)

ONNX

Native ONNX ops (torch.onnx.ops)
Fix rotary_embedding_23 implementation (#162865)
Create fake implementations for onnx ops; fix boolean mask in attention (#165780)
Fix onnx export on big endian machines (#167816)

Optimizer

Fix SWALR.state_dict and load_state_dict to serialize properly with weights_only=True (#163122)
Prevent problematic tensor aliasing in LRScheduler (#163098, #163120)
Fix LBFGS wolfe max iteration (#161488)

Profiler

Fix ProfilerState typo ('Disable' → 'Disabled') and expose PRIVATEUSE1 in ActiveProfilerType (#169166)

ROCm

Fix hardsigmoid op (#162758)
Fix GEMM carveout feature (#164303)
Disable __builtin_amdgcn_rcpf for gfx90a (#166454)
ROCm 7.0 BC-breaking preparations in JIT support (#160587, #166147)

Sparse Frontend

Fix mul(COO, COO) on MPS for hybrid COO variants (#166164)
Update torch.sparse_coo_tensor error message to include more information about input tensor properties (#161900)
Fix GradTrackingTensor sparse layout propagation (#165765)

XPU

Fix OneDNN deconvolution with output_padding on Intel GPU (#169176)
Fix conv1d precision error on Intel GPU (#162944)
Fix incorrect FLOPs counting of convolution_overrideable on Intel GPU(#166839)
Fix performance drop in AOTI on Intel GPU (#163315)

Performance

Benchmark

Add attention benchmarking numbers to pytorch operator microbenchmarks (#164155)

CPU (AArch64)

Improved aarch64 performance with optimizations for type conversions (bfloat16, FP16, bool), erf function, and autovectorization enhancements (#166049, #166262, #166306, #166330, #166594, #166641, #166739, #166880, #166958)

CUDA

Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval) (#166715)
Reduce register pressure in radix_sort_pairs to improve torch.sort performance (#167094)
Add Flash Attention 4 to sdpa (#167348)
Vectorize stores in cat for all dtypes on CUDA (#162440)
Expose pinned_reserve_segment_size_mb to speed up pinned memory allocation (#164501)
torch.topk: refactor global histogram/cumsum into a dedicated kernel to improve performance on CUDA (#164459)
Vectorize 8 elements on 16=bit data types for sum/mean to improve performance (#165055)
Switch order of blocked reduce when vectorize loads to improve performance (#165178)

cuDNN

Reenable cuDNN for 64-bit depthwise convolutions (#168364)

Distributed Checkpointing

Add timeout for checkpoint background process join (#162828)
Disable GC in process based async checkpointing (#169613)
Optimize global save-plan validation (#166820)
state dict staging fixes (#166025)

Dynamo

Faster tracing of some pytree functions (#168342)

FX

Move Node._prepend/Node._remove_from_list to C++ (#165882)
Optimize torch.fx.Node.replace_all_uses_with (#165889)

Inductor

Naive foreach autotune support (#162053)
Invert unary read and write for better fusion. (#161404)
Generate fused RMS/layer norm backward. (#165370)
Optimize cold compile time when cudagraphs-partition is enabled. (#167132)
Reduce cold compilation time caused by duplicated user-defined Triton kernels. (#168292)
Optimize identity permute in empty_permuted decomposition. (#169731)
Properly enlarge XBLOCK/set num_warps=1 for B200 inner persistent reductions. (#168335)
Improved heuristic for operator reordering for peak memory. (#161810)

Quantization

Make prepare and convert faster by caching (#162550)
Add onednn context cache for CPU qlinear to improve performance (#168150)

Release Engineering

Add operator microbenchmarks for attention, convolution, and optimizer operations to CI (#165915, #166331, #168101)
Add HuggingFace LLM benchmarks and cleanup benchmark model configurations (#156967, #164815, #164816)

ROCm

Use hipSolver instead of MAGMA for Cholesky (#163977)
Layer norm now uses __builtin_amdgcn_rcpf(x) instead of 1.f/x (#165589)
OffsetCalc Unroll Optimization (#161700)
Improve perf for elementwise broadcast with mixed dtype (#163562)
Implement float32 copy kernel (#163869)
Improve non stride-one backwards indexing for small index sets (#164409)
Adjust grid size for non-unit stride backwards indexing (#165026)
Normalization update to block size (#165941)
Deserialize loads in planer sum portion of reduce() of norm. ()

torch.func

20x less memory use and 37.25% speedup in min_cut_rematerialization_partition when using the new dp knapsack solver, compared to existing default one (dp) (#160914)

Documentation

Autograd

Add inference_mode hint message to use eval with inference. (#163619)

CUDA

Add Documentation for Device APIs (#162834)
Adding aliases for CUDA and XPU API documentation (#162984)
Clarify safety of CUDA graph memory pool sharing across graphs in documentation (#166975)

Distributed

c10d
- Complete documentations for all distributed c10d apis (#165194)

Dynamo

Updated documentation for tlparse (#171339). tlparse is a compilation report tool that processes TORCH_TRACE logs to generate interactive HTML reports showing how your model was compiled. When reporting bugs to PyTorch developers, we encourage you to attach the trace log or tlparse output to provide critical debugging information to help us bisect the issue.

FX

Add docs for torch.fx.experimental.unification (#167334)
Fix the split_module tutorial code (#166154)

Inductor

Updated documentation for tlparse (#171339) (#162975). tlparse is a compilation report tool that processes TORCH_TRACE logs to generate interactive HTML reports showing how your model was compiled. When reporting bugs to PyTorch developers, we encourage you to attach the trace log or tlparse output to provide critical debugging information to help us bisect the issue.
Update FlexConfig documentation. (#162533)

Ahead-Of-Time Inductor (AOTI)

[AOTI] Update AOTInductor tutorial (#163808)

torch.nn

Update CTCLoss docs float32 input required for CuDNN (#162042)
Update LPPool docs to clarify ceil_mode padding semantics when ceil_mode=True (#163186)

ONNX

Update export docstring (#162622)
Fix incorrect attention example in ONNX exporter docstring (#167646)

Profiler

Add documentation for FunctionEvent (#167688)

Quantization

Document some quantization public apis (#165160)
Add missing method docstrings for pytorch quantization classes (#165199)

XPU

Add new supported client GPU Panther Lake in "Get Started with XPU" page (#170517)

Security

Developers

Composability

Removed guard_size_oblivious from internal code, replacing most usages with guard_if_{false|true}. Both APIs are used in framework code that gets traced through to make it more friendly to unbacked symints, but the new APIs are more intuitive (#164664, #164665, #167232)

Distributed

c10d
- Added TCPStore based debug page and fr trace analysis with py-spy support (#169095, #169144, #169147, #167871)
- Modernized c10d code base with python code older than 3.10 removed (#163613, #163456, #163440, #167173)
- Enabled FlightRecorder for torchft with dynamic dumping path and a reset API (#164752, , , , ,)

FX

Refactor proxy_tensor (#165266)
Fix invalid symbol definition emitted in fx_graph_runnable.py (#166529)
Add debug-level logging to Interpreter.run_node (#117351) (#166622)
Fix an unsafe indexing in fx exception handling (#169140)
Type annotations for torch/_higher_order_ops/flat_apply.py (#168933)
Add recompute tags (from AC) into GraphModule.print_readable() by default (#167735)
Apply ruff UP035 rule (#165214, #163744)
Add model code stack trace to cuda.memory._snapshot (#166676)

Inductor

Add API for scheduling overlap from inductor configs. (#169693)
Make LOCK_TIMEOUT in codecache configurable. (#165030)
Add debug output for specific pattern matching. (#169603)
Add overridable env var for disabling FX graph cache. (#166138)
Add subsystem support to pattern matcher. (#163922)
Add pre-grad graph bisecting support. (#166344)
Decouple flags for optimization and debug symbols. (#167385) (#167575)
Introduce HOP for inductor compiled regions to allow torch dispatch. (#167844)

Release Engineering

Migrate from setup.py to modern Python build tools (pip install and python -m build) (#156711, #156712)

ROCm

Add Rocm to Operator Microbenchmark CI (#164173)
Enable TD for all ROCm default and distributed config workflows (#168225)
Expand trunk.yml coverage for ROCm (#168162)
cudagraph trees ut fixes (#163592)
test_convolution.py uses miopen immediate mode (#164598)
Keep amdgpu-coerce-illegal-types flag if rocm version is less than 7.2 (#165789)
Use a ROCm version string without hash. (#166336)
Dynamo benchmarks: remove outdated flaky models and enable deterministic algorithms (#169024)

XPU

Upgrade Intel GPU software stack package to intel-deep-learning-essentials-2025.3 (#166829)