source: kdnuggets: 3 numpy tricks for numerical performance

level: technical

many data scientists write numpy code that runs slowly because they use python loops or create unnecessary array copies. numpy's speed comes from compiled c code that operates on contiguous memory blocks. to get the best performance, you need to understand how numpy handles computation and memory. three key techniques are vectorization with broadcasting, in-place operations, and using memory views instead of copies.

vectorization replaces explicit loops with operations on whole arrays. for example, standardizing columns of a matrix with a nested loop took about 11 seconds for a 50,000 by 1,000 array. using broadcasting to subtract column means and divide by standard deviations took only 0.2 seconds, a 56x speedup. broadcasting aligns arrays of different shapes without copying data, and the calculations run in compiled c. avoid np.vectorize, which is just a wrapper for a slow python loop.

in-place operations and the out parameter prevent temporary array allocations. a chained expression like y = 2 * x + 3 creates intermediate arrays, wasting memory and time. by pre-allocating an output array and using out in functions like np.multiply and np.add, you can write results directly into that buffer. this keeps data in the cpu cache and speeds up execution. also, basic slicing returns a view that shares the original data, while advanced indexing with lists or masks copies data. using views for sub-sampling is nearly instant, but be careful because modifying a view changes the original array.

why it matters: using these techniques can dramatically speed up data processing pipelines and reduce memory usage, which is critical when working with large datasets in ai and data science.


source: kdnuggets: 3 numpy tricks for numerical performance