source: hugging face blog: profiling in pytorch (part 1): a beginner's guide to torch.profiler

level: technical

profiling is key to optimizing pytorch code, but traces can be hard to read. this post starts a series that teaches profiling from scratch. it uses a simple function: a matrix multiplication followed by a bias add. the profiler gives two outputs: a table with timing stats and a trace showing when cpu and gpu events happen. the table helps spot hotspots, while the trace shows the dispatch chain and idle gaps.

running on small 64x64 matrices shows the gpu is mostly idle. the cpu time is in milliseconds, but gpu time is in microseconds. this is an overhead-bound regime: the gpu finishes fast, but the cpu spends time launching kernels. increasing matrix size to 4096x4096 shifts the workload to compute-bound, with gpu time now comparable to cpu time. the trace reveals a long delay in the first profiler step due to one-time setup like memory allocation. adding warmup steps hides this overhead.

the trace also shows a 2.5 ms offset between cpu and gpu lanes. this is caused by a buffer request on the gpu, seen as a gap between kernels. changing the profiler schedule confirms it happens only once. the dispatch chain is visible in the trace: from the python call down to aten operations and finally cuda kernels. understanding this chain is essential for later optimization with tools like torch.compile.

why it matters: profiling helps data scientists and ml engineers find and fix performance bottlenecks in pytorch models, leading to faster training and inference.


source: hugging face blog: profiling in pytorch (part 1): a beginner's guide to torch.profiler