agentic workflows for data science pipelines

source: kdnuggets: 5 agentic workflows to automate your data science pipeline

level: technical

data scientists spend nearly half their time on data preparation and cleaning, not on modeling or insight generation. tasks like profiling columns, running exploratory data analysis scripts, and grid-searching hyperparameters follow explicit rules, making them automatable with agents. agentic workflows absorb this procedural weight, allowing focus on evaluative decisions such as model sense-checking and feature informativeness. platforms like databricks are integrating agentic data science capabilities to compress the time from question to insight.

the first workflow automates exploratory data analysis. an agent loads the dataset, profiles columns, flags issues by severity, and produces a structured markdown report. it uses a reasoning and acting loop with tools for profiling and flagging, then synthesizes findings into a prioritized list. in a retail transaction example, the agent flagged revenue skew and null rates, recommending log transforms and feature parsing in under 30 seconds. a human reviews the report and acts on recommendations, skipping manual diagnostics.

the second workflow handles feature engineering and selection. the agent proposes candidate features based on data profile and domain context, generates transformation code, evaluates each against a lightgbm baseline with cross-validation, and prunes features below an importance threshold using shap values. it explains selection decisions in plain language. in a churn prediction scenario, the agent generated and tested features like interaction terms, keeping only those that improved model performance, with rationale for each pruning decision.

why it matters: automating routine data science tasks with agents frees practitioners to focus on high-value decisions, speeding up pipelines and reducing repetitive work.

source: kdnuggets: 5 agentic workflows to automate your data science pipeline