pytorch 2.12: faster linalg, unified graphs, and mx export

source: pytorch blog: pytorch 2.12 release blog

level: technical

pytorch 2.12 includes a major speedup for batched eigenvalue decomposition on cuda. the linalg.eigh function now uses cusolver's syevj_batched kernel, which processes many small or medium matrices in a single gpu operation. this change can make workloads that previously took minutes run in seconds. the update is especially useful for scientific computing and machine learning tasks that rely on eigendecompositions of batched matrices.

a new torch.accelerator.graph api provides a unified way to capture and replay graphs across different hardware backends, including cuda, xpu, and out-of-tree devices. this replaces backend-specific implementations with a consistent interface. torch.export now supports microscaling (mx) quantization formats, allowing models with aggressive compression to be fully exported and deployed. the adagrad optimizer also gains a fused option, performing the entire optimizer step in a single cuda kernel to reduce overhead.

other improvements include the ability to capture torch.cond control flow inside cuda graphs, using conditional if nodes to evaluate branches on the gpu. distributed training sees better profiling with flow ids and sequence numbers for nccl collectives, plus flightrecorder support for more backends. rocm users get expandable memory segments, rocshmem symmetric memory collectives, and flexattention pipelining for faster attention. apple mps now ships with precompiled metal-4 shaders to cut startup latency.

why it matters: these updates make pytorch faster and more portable across hardware, simplifying deployment of compressed models and improving distributed training debugging.

source: pytorch blog: pytorch 2.12 release blog