source: kdnuggets: using polars instead of pandas: performance deep dive

level: technical

pandas has been the standard for data work in python for years, but it slows down with millions of rows. groupby operations take seconds, intermediate copies eat ram, and window functions run as python loops. polars is a dataframe library built in rust on apache arrow. it uses lazy evaluation and runs operations in parallel across cpu cores. this article tests both libraries on three real data problems from stratascratch.

the first problem ranks users by email count. pandas uses groupby and rank with method='first', which allocates a rank array and resolves ties. polars sorts and uses with_row_count to assign sequential numbers, avoiding rank entirely. on millions of records, polars is 5–10x faster. the second problem finds users who made a second purchase within 1–7 days. pandas creates multiple dataframes, pivots, and computes date differences. polars uses a window expression with over and filters in one lazy chain, using less memory and running faster.

the third problem computes a cumulative average of monthly sales. pandas uses expanding().mean(), which loops over growing windows in python. polars uses cumsum in numpy after collecting monthly aggregates, running a single pass in rust. for small data, the difference is small, but for larger time series, polars scales better. rounding differences also appear: pandas uses banker's rounding, while the polars solution uses floor(x + 0.5) for round-half-up.

why it matters: switching to polars can significantly speed up data processing and reduce memory use for large datasets, making it a practical choice for data science workflows.


source: kdnuggets: using polars instead of pandas: performance deep dive