ROCm 7.2.0 Release
The release notes provide a summary of notable changes since the previous ROCm release.
Unclaimed project
Are you a maintainer of ROCm? Claim this project to take control of your public changelog and roadmap.
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
Simple Python version management
This repository started out as a learning in public project for myself and has now become a structured learning map for many in the community. We have 3 years under our belt covering all things DevOps, including Principles, Processes, Tooling and Use Cases surrounding this vast topic.
Proxmox VE Helper-Scripts (Community Edition)
The release notes provide a summary of notable changes since the previous ROCm release.
If you’re using AMD Radeon GPUs or Ryzen APUs in a workstation setting with a display connected, see the [Use ROCm on Radeon and Ryzen](https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html)
documentation to verify compatibility and system requirements.
The following are notable new features and improvements in ROCm 7.2.0. For changes to individual components, see Detailed component changes.
ROCm 7.2.0 adds support for RDNA4 architecture-based AMD Radeon AI PRO R9600D and AMD Radeon RX 9060 XT LP, and RDNA3 architecture-based AMD Radeon RX 7700 GPUs.
ROCm 7.2.0 extends the SLES 15 SP7 operating system support to AMD Instinct MI355X and MI350X GPUs.
For more information about:
AMD hardware, see Supported GPUs (Linux).
Operating systems, see Supported operating systems and ROCm installation for Linux.
Virtualization support remains unchanged in this release. For more information, see Virtualization support.
The software for AMD Data Center GPU products requires maintaining a hardware and software stack with interdependencies among the GPU and baseboard firmware, AMD GPU drivers, and the ROCm user space software. While AMD publishes drivers and ROCm user space components, your server or infrastructure provider publishes the GPU and baseboard firmware by bundling AMD’s firmware releases via AMD’s Platform Level Data Model (PLDM) bundle, which includes the Integrated Firmware Image (IFWI).
GPU and baseboard firmware versioning might differ across GPU families.
|
ROCm Version |
GPU |
PLDM Bundle (Firmware) |
AMD GPU Driver (amdgpu) |
AMD GPU |
|---|---|---|---|---|
| ROCm 7.2.0 | MI355X |
01.25.17.07 01.25.16.03 |
30.30.0 30.20.1 30.20.0 30.10.2 30.10.1 30.10 | 8.7.0.K |
| MI350X |
01.25.17.07 01.25.16.03 |
30.30.0 30.20.1 30.20.0 30.10.2 30.10.1 30.10 | ||
| MI325X[1] | 01.25.04.02 | 30.30.0 30.20.1 30.20.0[1] 30.10.2 30.10.1 30.10 6.4.z where z (0-3) 6.3.y where y (2-3) | ||
| MI300X[2] | 01.25.03.12 |
30.30.0 30.20.1 30.20.0 30.10.2 30.10.1 30.10 6.4.z where z (0–3) 6.3.y where y (2–3) | 8.7.0.K | |
| MI300A | BKC 26 | Not Applicable | ||
| MI250X | IFWI 47 (or later) | |||
| MI250 | MU5 w/ IFWI 75 (or later) | |||
| MI210 | MU5 w/ IFWI 75 (or later) | 8.7.0.K | ||
| MI100 | VBIOS D3430401-037 | Not Applicable |
[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.
[2]: For AMD Instinct MI300X KVM SR-IOV with Multi-VF (8 VF) support requires a compatible firmware BKC bundle for the GPU which will be released in coming months.
Node Power Management (NPM) optimizes power allocation and GPU frequency across multiple GPUs within a node using built-in telemetry and advanced control algorithms. It dynamically scales GPU frequencies to keep total node power within limits. Use AMD SMI to verify whether NPM is enabled and to check the node’s power allocation. This feature is supported on AMD Instinct MI355X and MI350X GPUs in both bare-metal and KVM SR-IOV virtual environments when paired with PLDM bundle 01.25.17.07. See the AMD SMI changelog for details.
The following models have been optimized for AMD Instinct MI350 Series GPUs:
The following models have been optimized for AMD Instinct MI300X GPUs:
HIP runtime now implements an optimized doorbell ring mechanism for certain graph execution topologies. It enables efficient batching of graph nodes. This enhancement provides better alignment with NVIDIA CUDA Graph optimizations.
HIP also adds a new performance test for HIP graphs with programmable topologies to measure graph performance across different structures. The test evaluates graph instantiation time, first-launch time, repeat launch times, and end-to-end execution for various graph topologies. The test implements comprehensive timing measurements, including CPU overhead and device execution time.
HIP runtime now implements a back memory set (memset) optimization to improve how memset nodes are processed during graph execution. This enhancement specifically handles varying numbers of AQL (Architected Queue Language) packets for memset graph node due to graph node set params for AQL batch submission approach.
HIP runtime has removed the lock contention in async handler enqueue path. This enhancement reduces runtime overhead and maximizes GPU throughput, for asynchronous kernel execution, especially in multi-threaded applications.
To simplify cross-platform programming and improve code portability between AMD ROCm and other programming models, new HIP APIs have been added in ROCm 7.2.0.
The following new HIP library management APIs have been added:
hipLibraryGetKernel, gets a kernel from library.hipLibraryGetKernelCount, gets kernel count in library.hipLibraryLoadData, creates library object from code.hipLibraryLoadFromFile, creates library object from file.hipLibraryUnload, unloads the library.hipKernelGetName, returns function name for a hipKernel_t handle.hipKernelGetLibrary, returns Library handle for a hipKernel_t handle.hipLibraryEnumerateKernels, returns Kernel handles within a library.hipOccupancyAvailableDynamicSMemPerBlock API is added to return dynamic shared memory available per block when launching with the number of blocks on CU.
New Stream Management API hipStreamCopyAttributes is implemented for CUDA Parity improvement.
The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit. This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs. The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication.
In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see Installing rocSHMEM.
Implemented software-managed plan cache. The Plan Cache main features include:
hiptensorHandle_t.hipTensor has also been enhanced with:
C++17 to C++20.hipCUB, rocRAND, and rocThrust support building with target-agonistic Standard Portable Intermediate Representation - V (SPIR-V). It is currently in an early access state.
hipBLASLT has the following enhancements:
rocWMMA has the following enhancements:
perf_i8gemm sample has been added to demonstrate int8_t as matrix input data type.MIGraphX has the following enhancements:
The __AMDGCN_WAVEFRONT_SIZE and __AMDGCN_WAVEFRONT_SIZE__ macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using hipGetDeviceProperties in host code. Neither of these will result in a compile-time constant. For more information, see warpSize.
For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of __AMDGCN_WAVEFRONT_SIZE or __AMDGCN_WAVEFRONT_SIZE__ can be replaced with a user-defined macro or constexpr variable with the wavefront size(s) for the target hardware. For example:
#if defined(__GFX9__)
#define MY_MACRO_FOR_WAVEFRONT_SIZE 64
#else
#define MY_MACRO_FOR_WAVEFRONT_SIZE 32
#endif
AMD ROCm Simulation is an open-source toolkit on the ROCm platform for high-performance, physics-based and numerical simulation on AMD GPUs. It brings scientific computing, computer graphics, robotics, and AI-driven simulation to AMD Instinct GPUs by unifying the HIP runtime, optimized math libraries, and PyTorch integration for high-throughput real-time and offline workloads.
The libraries span physics kernels, numerical solvers, rendering, and multi-GPU scaling, with Python-friendly APIs that plug into existing research and production pipelines. By using ROCm’s open-source GPU stack on AMD Instinct products, you gain optimized performance, flexible integration with Python and machine learning frameworks, and scalability across multi-GPU clusters and high-performance computing (HPC) environments. For more information, see the ROCm Simulation documentation.
The release in December 2025 introduced support for ROCm 7.0.0 for the two components:
Taichi Lang is an open-source, imperative, parallel programming language for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks (such as LLVM) to offload the compute-intensive Python code to the native GPU or CPU instructions.
GSplat (Gaussian splatting) is a highly efficient technique for real-time rendering of 3D scenes trained from a collection of multiview 2D images of the scene. It has emerged as an alternative to neural radiance fields (NeRFs), offering significant advantages in rendering speed while maintaining visual quality.
ROCm Optiq (Beta) is AMD’s next‑generation visualization platform designed to bring clarity to performance analysis. You can use the ROCm Optiq GUI to view trace files captured with the ROCm Systems Profiler on any supported Microsoft Windows or Linux system.
With ROCm Optiq, developers can pinpoint performance bottlenecks — from pipeline stalls and memory bandwidth limitations to suboptimal kernel launches. ROCm Optiq delivers a comprehensive, end‑to‑end view of system behavior, empowering teams to optimize their workflows by correlating GPU workloads with in‑application CPU events and hardware resource utilization. For more information, see the ROCm Optiq documentation.
The AMD ROCm Life Science (ROCm-LS) toolkit is a GPU-accelerated library suite developed for life science and healthcare applications, offering a robust set of tools optimized for AMD hardware. In December 2025, ROCm-LS transitioned from early access (EA) to general availability (GA).
The ROCm-LS GA release is marked with the transition of hipCIM from EA to production-ready and support for ROCm 7.0.x. For more information, see ROCm-LS 25.11 release notes.
ROCm provides a comprehensive ecosystem for deep learning development. For more information, see Deep learning frameworks for ROCm and the Compatibility matrix for the complete list of Deep learning and AI framework versions tested for compatibility with ROCm. AMD ROCm has officially updated support for the following Deep learning and AI frameworks:
ROCm 7.2.0 enables support for JAX 0.8.0. For more information, see JAX compatibility.
ROCm 7.2.0 enables support for ONNX Runtime 1.23.2.
Volcano Engine Reinforcement Learning (verl) is a reinforcement learning framework designed for large language models (LLMs). verl offers a scalable, open-source fine-tuning solution by using a hybrid programming model that makes it easy to define and run complex post-training dataflows efficiently. It is now supported on ROCm 7.0.0 (previously only supported on ROCm 6.2.0). For more information, see verl compatibility.
Ray is a unified framework for scaling AI and Python applications from your laptop to a full cluster, without changing your code. Ray consists of a core distributed runtime and a set of AI libraries for simplifying machine learning computations. It is now supported on ROCm 7.0.0 (previously only supported on ROCm 6.4.1). For more information, see Ray compatibility.
The ROCm Offline Installer Creator 7.2.0 includes the following features and improvements:
See ROCm Offline Installer Creator for more information.
The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues.
For more information, see ROCm Runfile Installer.
The ROCm examples repository has been expanded with examples for the following ROCm components:
Usage examples are now available for the ROCgdb debugger.
ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.
The newest resource for ROCm and HIP developers is the AMD ROCm Programming Guide. This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The HIP documentation set continues to provide detailed information, tutorials, and reference content.
The HIP Programming Guide section includes a new topic titled “Understanding GPU performance”. It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: Performance guidelines and Hardware implementation.
Tutorials for AI developers have been expanded with the following two new tutorials:
For more information about the changes, see the Changelog for the AI Developer Hub.
The following table lists the versions of ROCm components for ROCm 7.2.0, including any version changes from 7.1.1 to 7.2.0. Click the component's updated version to go to a list of its changes.
Click {fab}github to go to the component's source code on GitHub.
| Category | Group | Name | Version | |
|---|---|---|---|---|
| Libraries | Machine learning and computer vision | Composable Kernel | 1.1.0 ⇒ 1.2.0 | |
| MIGraphX | 2.14.0 ⇒ 2.15.0 | |||
| MIOpen | 3.5.1 ⇒ 3.5.1 | |||
| MIVisionX | 3.4.0 ⇒ 3.5.0 | |||
| rocAL | 2.4.0 ⇒ 2.5.0 | |||
| rocDecode | 1.4.0 ⇒ 1.5.0 | |||
| rocJPEG | 1.2.0 ⇒ 1.3.0 | |||
| rocPyDecode | 0.7.0 ⇒ 0.8.0 | |||
| RPP | 2.1.0 ⇒ 2.2.0 | |||
| Communication | RCCL | 2.27.7 ⇒ 2.27.7 | ||
| rocSHMEM | 3.1.0 ⇒ 3.2.0 | |||
| Math | hipBLAS | 3.1.0 ⇒ 3.2.0 | ||
| hipBLASLt | 1.1.0 ⇒ 1.2.1 | |||
| hipFFT | 1.0.21 ⇒ 1.0.22 | |||
| Primitives | hipCUB | 4.1.0 ⇒ 4.2.0 | ||
| hipTensor | 2.0.0 ⇒ 2.2.0 | |||
| rocPRIM | 4.1.0 ⇒ 4.2.0 | |||
| Tools | System management | AMD SMI | 26.2.0 ⇒ 26.2.1 | |
| ROCm Data Center Tool | 1.2.0 | |||
| rocminfo | 1.0.0 | |||
| ROCm SMI | 7.8.0 | |||
| Performance | ROCm Bandwidth Test | 2.6.0 ⇒ 2.6.0 | ||
| ROCm Compute Profiler | 3.3.1 ⇒ 3.4.0 | |||
| ROCm Systems Profiler | 1.2.1 ⇒ 1.3.0 | |||
| Development | HIPIFY | 20.0.0 ⇒ 22.0.0 | ||
| ROCdbgapi | 0.77.4 | |||
| ROCm CMake | 0.14.0 | |||
| ROCm Debugger (ROCgdb) | 16.3 | |||
| Compilers | HIPCC | 1.1.1 | ||
| llvm-project | 20.0.0 ⇒ 22.0.0 | |||
| Runtimes | HIP | 7.1.1 ⇒ 7.2.0 | ||
| ROCr Runtime | 1.18.0 | |||
The following sections describe key changes to ROCm components.
For a historical overview of ROCm component updates, see the {doc}`ROCm consolidated changelog </release/changelog>`.
amd-smi monitor CLI.
amd-smi monitor --gpu-board-temps for GPU board temperature sensors.amd-smi monitor --base-board-temps for base board temperature sensors.(amdsmi-npm-changelog)=
New Node Power Management (NPM) APIs and CLI options for node monitoring.
amdsmi_get_node_handle() gets the handle for a node device.amdsmi_get_npm_info() retrieves Node Power Management information.amdsmi_npm_status_t indicates whether NPM is enabled or disabled.amdsmi_npm_info_t contains the status and node-level power limit in watts.amd-smi node subcommand for NPM operations via CLI.OAM_ID 0 only.The following C APIs are added to amdsmi_interface.py:
amdsmi_get_cpu_handle()amdsmi_get_esmi_err_msg()amdsmi_get_gpu_event_notification()amdsmi_get_processor_count_from_handles()amdsmi_get_processor_handles_by_type()amdsmi_gpu_validate_ras_eeprom()amdsmi_init_gpu_event_notification()amdsmi_set_gpu_event_notification_mask()amdsmi_stop_gpu_event_notification()amdsmi_get_gpu_busy_percent()Additional return value to amdsmi_get_xgmi_plpd() API:
policies is added to the end of the dictionary to match API definition.plpds is marked for deprecation as it has the same information as policies.PCIe levels to amd-smi static --bus command.
--bus option has been updated to include the range of PCIe levels that you can set for a device.evicted_time metric for KFD processes.
amd-smi monitor -q and amd-smi process.amdsmi_get_gpu_process_list(), amdsmi_get_gpu_compute_process_info()
, and amdsmi_get_gpu_compute_process_info_by_pid().New VRAM types to amdsmi_vram_type_t.
amd-smi static --vram and amdsmi_get_gpu_vram_info() now support the following types: DDR5, LPDDR4, LPDDR5, and HBM3E.Support for PPT1 power limit information.
amdsmi_get_supported_power_cap(): Returns power cap types supported on the device (PPT0, PPT1). This will allow you to know which power cap types you can get/set.amdsmi_get_power_cap_info() and amdsmi_set_power_cap().set and static commands regarding support for PPT1.The amd-smi command now shows hsmp rather than amd_hsmp.
hsmp driver version can be shown without the amdgpu version using amd-smi version -c.The amd-smi set --power-cap command now requires specification of the power cap type.
amd-smi set --power-cap <power-cap-type> <new-cap>.The amd-smi reset --power-cap command will now attempt to reset both PPT0 and PPT1 power caps to their default values. If a device only has PPT0, then only PPT0 will be reset.
The amd-smi static --limit command now has a PPT1 section when PPT1 is available. The static --limit command has been updated to include PPT1 power limit information when available on the device.
amdsmi_get_gpu_od_volt_info() returned a reference to a Python object. The returned dictionary was changed to return values in all fields.pk_int4_t in the CK Tile weight preshuffle GEMM.grouped_gemm kernels to perform multi_d elementwise operation.f32 to FMHA (fwd/bwd).Arch, to make_kernel to support linking multiple object files that have the same kernel compiled for different architectures.BlockSize in make_kernel and CShuffleEpilogueProblem to support Wave32 in CK Tile.hipLibraryEnumerateKernels returns kernel handles within a library.hipKernelGetLibrary returns library handle for a hipKernel_t handle.hipKernelGetName returns function name for a hipKernel_t handle.hipLibraryLoadData creates library object from code.hipLibraryLoadFromFile creates library object from file.hipLibraryUnload unloads library.hipLibraryGetKernel gets a kernel from the library.hipLibraryGetKernelCount gets kernel count in library.hipStreamCopyAttributes copies attributes from source stream to destination stream.hipOccupancyAvailableDynamicSMemPerBlock returns dynamic shared memory available per block when launching numBlocks blocks on CU.hipMemLocationTypeHost enables handling virtual memory management in host memory location, in addition to device memory.hipGetProcAddress enables searching for the per-thread version symbols:
HIP_GET_PROC_ADDRESS_DEFAULTHIP_GET_PROC_ADDRESS_LEGACY_STREAMHIP_GET_PROC_ADDRESS_PER_THREAD_DEFAULT_STREAMHIPBLAS_CLIENT_RAM_GB_LIMIT environment variable.BF16 input data type with an FP32 output data type for gfx90a.HIPBLASLT_OVERRIDE_COMPUTE_TYPE_XF32 to override the compute type from xf32 to other compute types.HIPBLAS_STATUS_INTERNAL_ERROR issue that could occur with various sizes in CPX mode.hipFFTW execution functions, where input and output data buffers differ from the buffers specified at plan creation:
hipFFTw support.--local-headers to enable hipification of quoted local headers (non-recursive).--local-headers-recursive to enable hipification of quoted local headers recursively.cuda_bf16.h import in hipification.ROCSOLVER_LEVELS and ROCSOLVER_LAYER.--clients-only option to the install.sh and rmake.py scripts for building only the clients when using a version of hipSPARSE that is already installed.hipsparseCreate functions.FP16 and FP8(E4M3) data types.hiptensorHandleWritePlanCacheToFile to write the plan cache of a hipTensor handle to a file.hiptensorHandleReadPlanCacheFromFile to read a plan cache from a file into a hipTensor handle.simple_contraction_plan_cache to demonstrate plan cache usages.plan_cache_test to test the plan cache across various tensor ranks.github rocm-libraries. This repository consolidates a number of separate ROCm libraries and shared components.
hiptensor/hiptensor.hpp and hiptensor/hiptensor_types.hpp are now deprecated. Use hiptensor/hiptensor.h and hiptensor/hiptensor_types.h instead.-foffload-lto=thin. For more information, see ROCm compiler reference.DepthToSpace Op.bias and key_mask_padding inputs for the MultiHeadAttention operator.dim_params input parameter to the parse_onnx Python call.get_onnx_operators().--input-dim instead of --batch to set any dynamic dimensions when using migraphx-driver.if branches.GroupQueryAttention.PipelineRepoRef parameter in CI.LRN operator to an optimized pooling operator.find_matches function.split_reduce.pointwise: Wrong number of arguments error when quantizing certain models to int8.simplify_reshapes.rbuild installation instructions to use Python venv to avoid warning.${ROCM_PATH}/lib/llvm/bin.rocDecode and rocJPEG support for hardware decode.NCCL_DEBUG=NONE.reduceCopyPacks pipelining for gfx950.EnumRegistry to register all the enums present in rocAL.Argument class which stores the value and type of each argument in the Node.PipelineOperator class to represent operators in the pipeline with metadata.${ROCM_PATH}/lib/llvm/bin.ResizeScalingMode, ResizeInterpolationType, MelScaleFormula, AudioBorderType, and OutOfBoundsPolicy in commons.h.crop_w and crop_h values were not correctly updated.TurboJPEG.--clients-only option to the install.sh and rmake.py scripts to allow building only the clients while using an already installed version of rocALUTION.syrk_ex function for both C and FORTRAN, without API support for the ILP64 format.tpmv and sbmv functions.ROCBLAS_CLIENT_RAM_GB_LIMIT environment variable.rocdecode-host package must be installed to use the FFmpeg decoder.libdrm path configuration and libva version requirements for ROCm and TheRock platforms.libdrm path configuration and libva version requirements for ROCm and TheRock platforms.libva-devel instead of libva-amdgpu/libva-amdgpu-devel.${ROCM_PATH}/lib/llvm/bin location.rocm-bandwidth-test folder is no longer present after driver uninstallation.--list-blocks <arch> option to general options. It lists the available IP blocks on the specified arch (similar to --list-metrics). However, cannot be used with --block.
config_delta/gfx950_diff.yaml to analysis config YAMLs to track the revision between the gfx9xx GPUs against the latest supported gfx950 GPUs.
Analysis db features
AMDGPU driver info and GPU VRAM attributes in the system info section of the analysis report.
CU Utilization metric to display the percentage of CUs utilized during kernel execution.
-b/--block accepts block alias(es). See block aliases using command-line option --list-blocks <arch>.
Analysis configs YAMLs are now managed with the new config management workflow in tools/config_management/.
amdsmi python API is used instead of amd-smi CLI to query GPU specifications.
Empty cells replaced with N/A for unavailable metrics in analysis.
database mode from ROCm Compute Profiler in favor of other visualization methods, rather than Grafana and MongoDB integration, such as the upcoming Analysis DB-based Visualizer.
N/A in memory chart diagram.Active CUs metric has been deprecated in favor of CU Utilization and will be removed in a future release.ROCPROFSYS_PERFETTO_FLUSH_PERIOD_MS configuration setting to set the flush period for Perfetto traces. The default value is 10000 ms (10 seconds).rocpd schema from rocprofiler-sdk-rocpd.rocprof-sys-instrument uses the Fortran program main function instead of the C wrapper.rocprof-sys-python with ROCPROFSYS_USE_ROCPD enabled.rocpd output.BENCHMARK_USE_AMDSMI. It is set to OFF by default. When this option is set to ON, it lets benchmarks use AMD SMI to output more GPU statistics.device_search.apply_config_improvements.pyfile , which generates improved configs by taking the best specializations from old and new configs.
--help for usage instructions, and see rocPRIM Performance Tuning for more information.device_radix_sort onesweep variant.rocprim::device_scan_by_key failed when performing an "in-place" inclusive scan by reusing "keys" as output, by adding a buffer to store the last keys of each block (excluding the last block). This fix only affects the specific case of reusing "keys" as output in an inclusive scan, and does not affect other cases.float_bit_mask for rocprim::half.__builtin_clz, __builtin_ctz, and similar builtins are called.rocprim::detail::histogram_impl.rocprim::partition_threeway with large input data sizes on later ROCm builds. A workaround is currently in place.hipStreamCopyAttributes API implementation.${ROCM_PATH}/lib/llvm/bin for AMD Clang.-DUSE_SYSTEM_LIB to allow tests to be built from ROCm libraries provided by the system.launch method in host_system and device_system, so that kernels with all supported arches can be compiled with correct configuration during host pass. All generators are updated accordingly for support of SPIR-V. To invoke SPIR-V, it should be built with -DAMDGPU_TARGETS=amdgcnspirv.mrg31k3p_state, mrg32k3a_state, xorwow_state and philox4x32_10_state states no longer use the boxmuller_float_state and boxmuller_double_state states, and the boxmuller_float and boxmuller_double variables are set with NaN as default values.rocshmem_procshmem_<TYPE>_fetch_<op>rocshmem_<TYPE>_atomic_{and,or,xor,swap}rocshmem_<TYPE>_atomic_fetch_{and,or,xor,swap}rocsparse_spmv routine.rocsparse_sptrsv and rocsparse_sptrsm routines for triangular solve.--clients-only option to the install.sh and rmake.py scripts to only build the clients for a version of rocSPARSE that is already installed.rocsparse_spmv_alg_csr_nnzsplit to rocsparse_spmv. This algorithm might be superior to the existing adaptive algorithm rocsparse_spmv_alg_csr_adaptive when running the computation a small number of times because it avoids paying the analysis cost of the adaptive algorithm.--no-rocblas option with the install.sh or rmake.py build scripts.rocsparse_sddmm routine when using CSR format, especially as the number of columns in the dense A matrix (or rows in the dense B matrix) increases.rmake.py build script to properly handle auto and all options when selecting offload targets.std::fma casting in host routines to properly deduce types. This could have previously caused compilation failures when building from source.thrust::unique_ptr - a smart pointer for managing device memory with automatic cleanup.BUILD_OFFLOAD_COMPRESS. When rocThrust is built with this option enabled, the --offload-compress switch is passed to the compiler. This causes the compiler to compress the binary that it generates. Compression can be useful when compiling for a large number of targets, because it often results in a larger binary. Without compression, in some cases, the generated binary may become so large symbols are placed out of range, resulting in linking errors. The new BUILD_OFFLOAD_COMPRESS option is set to ON by default.perf_i8gemm to demonstrate int8_t as matrix input data type.github rocm-libraries. This repository consolidates a number of separate ROCm libraries and shared components.
${ROCM_PATH}/lib/llvm/bin.copy_param_float() and copy_param_uint() mem copy helper functions have been removed as buffers now consistently use pinned/HIP memory.ROCm known issues are noted on {fab}github GitHub. For known
issues related to individual components, review the Detailed component changes.
Installing multiple versions of ROCm on the same system might result in the amd-smi CLI functioning incorrectly.
As a workaround, follow any of the preferred options:
Option 1: If only the CLI or C++ library are needed, uninstall the amdsmi Python package:
python3 -m pip uninstall amdsmi
Option 2: Reinstall the Python library from your target ROCm version:
# Remove previous installation
python3 -m pip uninstall amdsmi
# Install from target ROCm instance
cd /opt/rocm/share/amd_smi
python3 -m pip install --user .
`sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails.
For detailed instructions, see Install the Python library for multiple ROCm instances. The issue will be fixed in a future ROCm release. See GitHub issue #5875.
You might experience intermittent errors or segmentation faults when running JAX workloads. The issue is currently under investigation and will be addressed in an upcoming ROCm release. See GitHub issue #5878.
If you’re using hipBLASLt on AMD Instinct MI325X GPUs for large FP8 GEMM operations (such as 9728x8192x65536), you might observe a noticeable performance variation. The issue is currently under investigation and will be fixed in a future ROCm release. See GitHub issue #5734.
The following are previously known issues resolved in this release. For resolved issues related to individual components, review the Detailed component changes.
The RCCL performance degradation issue affecting AMD Instinct MI300X GPUs with AMD Pollara AI NIC for specific collectives and message sizes has been resolved. The impacted collectives included Scatter, AllToAll, and AlltoAllv. See GitHub issue #5717.
The issue where rocprofv3 tool failed on RPM-based operating systems (such as RHEL 8) with Python 3.10 (and later) due to missing ROCPD bindings has been resolved. See GitHub issue #5606.
An issue where applications using OpenCV packages failed due to package incompatibility between OpenCV built on Ubuntu 24.04 and Debian 13 has been resolved. See GitHub issue #5501.
An issue where running the amd-smi CLI on GPUs with partitioning support, such as the AMD
Instinct MI300 Series, which produced repeated kernel error messages in the
system logs, has been resolved. The issue occurred when amd-smi attempted to open invalid partition device nodes during device permission checks. As a result, the AMD GPU Driver (amdgpu) logged errors in dmesg, such as:
amdgpu 0000:15:00.0: amdgpu: renderD153 partition 1 not valid!
These repeated kernel logs could clutter the system logs and cause unnecessary concern about GPU health. See GitHub issue #5720.
An issue where some gemm_ex operations with 8-bit input data types (int8, float8, bfloat8) for specific matrix dimensions (K = 1 and number of workgroups > 1) yield incorrect results has been resolved. The root cause was incorrect tailloop code that ignored workgroup index when calculating valid element size. See GitHub issue #5722.
The following changes to the ROCm software stack are anticipated for future releases.
The ROCm Offline Installer Creator is deprecated with the ROCm 7.2.0 release. Equivalent installation capabilities are available through the ROCm Runfile Installer, a self-extracting installer that is not based on OS package managers. This installer will be removed in a future release.
ROCm SMI will be phased out in an upcoming ROCm release and will enter maintenance mode. After this transition, only critical bug fixes will be addressed and no further feature development will take place.
It's strongly recommended to transition your projects to AMD SMI, the successor to ROCm SMI. AMD SMI includes all the features of the ROCm SMI and will continue to receive regular updates, new functionality, and ongoing support. For more information on AMD SMI, see the AMD SMI documentation.
ROCTracer, ROCProfiler, rocprof, and rocprofv2 are deprecated and only critical defect fixes will be addressed for older versions of the profiling tools and libraries. It's strongly recommended to upgrade to the latest version of the ROCprofiler-SDK library and the (rocprofv3) tool to ensure continued support and access to new features.
It's anticipated that ROCTracer, ROCProfiler, rocprof, and rocprofv2 will reach end-of-life by future releases, aligning with Q1 of 2026.
ROCm Object Tooling tools roc-obj-ls, roc-obj-extract, and roc-obj were
deprecated in ROCm 6.4, and will be removed in a future release. Functionality
has been added to the llvm-objdump --offloading tool option to extract all
clang-offload-bundles into individual code objects found within the objects
or executables passed as input. The llvm-objdump --offloading tool option also
supports the --arch-name option, and only extracts code objects found with
the specified target architecture. See llvm-objdump
for more information.
For detailed installation instructions, refer to ROCm installation on Linux. ROCm binaries for installation are located at repo.radeon.com and listed below:
AMD GPU Driver (amdgpu):
ROCm:
ROCm Offline Installer Creator: repo.radeon.com/rocm/installer/rocm-linux-install-offline/rocm-rel-7.2/
ROCm Runfile Installer: repo.radeon.com/rocm/installer/rocm-runfile-installer/rocm-rel-7.2
| hipfort |
| 0.7.1 |
| hipRAND | 3.1.0 |
| hipSOLVER | 3.1.0 ⇒ 3.2.0 |
| hipSPARSE | 4.1.0 ⇒ 4.2.0 |
| hipSPARSELt | 0.2.5 ⇒ 0.2.6 |
| rocALUTION | 4.0.1 ⇒ 4.1.0 |
| rocBLAS | 5.1.1 ⇒ 5.2.0 |
| rocFFT | 1.0.35 ⇒ 1.0.36 |
| rocRAND | 4.1.0 ⇒ 4.2.0 |
| rocSOLVER | 3.31.0 ⇒ 3.32.0 |
| rocSPARSE | 4.1.0 ⇒ 4.2.0 |
| rocWMMA | 2.1.0 ⇒ 2.2.0 |
| Tensile | 4.44.0 |
| rocThrust |
| 4.1.0 ⇒ 4.2.0 |
| ROCm Validation Suite | 1.3.0 |
| ROCProfiler | 2.0.0 |
| ROCprofiler-SDK | 1.0.0 ⇒ 1.1.0 |
| ROCTracer | 4.1.0 |
| ROCr Debug Agent | 2.1.0 |