source: kdnuggets: stop writing loops in pandas: 7 faster alternatives to try
level: technical
row-by-row iteration is a common performance bottleneck in pandas. on small datasets it goes unnoticed, but for large data it becomes impactful. pandas is built on numpy, which runs operations on entire arrays using compiled c code. looping in python bypasses this and forces operations back into the python interpreter one row at a time. this article covers seven alternatives to loops, each for a different kind of transformation.
for arithmetic on columns, use vectorized operations like df['revenue'] = df['price'] * df['quantity']. for conditional logic, .apply() lets you pass a function over a column, such as assigning a shipping label based on days to ship. for binary conditions, np.where() is a fast vectorized choice, like flagging senior discounts. for multiple conditions, np.select() assigns values based on a list of conditions, such as region-based tax rates. for value substitution, .map() with a dictionary is clean and fast, like mapping product categories to department codes.
for string manipulation, the .str accessor runs operations across a column without loops, such as extracting the first word and lowercasing it. for aggregating groups, .groupby() with .agg() computes group-level statistics like total revenue and average ship days per category. choosing the right tool depends on the operation: vectorized math for arithmetic, np.where() for simple conditions, np.select() for multiple conditions, .map() for lookups, .apply() for custom functions, .str for strings, and .groupby() for aggregations.
why it matters: using these methods avoids slow python loops and leverages pandas' optimized c-based operations, making data processing faster and more scalable for large datasets.
source: kdnuggets: stop writing loops in pandas: 7 faster alternatives to try