optimizing latency, reliability, and cost in llm agent workflows

source: arxiv artificial intelligence: toward reliable design of llm-enabled agentic workflows: optimizing latency-reliability-cost tradeoffs

level: research

modern ai systems often chain together multiple agents, some using large language models and others running standard code. this paper looks at the core tradeoffs between how fast a workflow runs, how reliable its outputs are, and how much it costs. the authors build performance models for both llm and non-llm agents. for llm agents, they use a parametric exponential function that links reliability to the number of reasoning and output tokens used.

the work focuses on sequential workflows where tasks pass from one agent to the next. under fixed limits on latency and cost, the researchers derive an optimal way to distribute tokens across agents. the main result is a water-filling policy: allocate more tokens to agents that give the biggest reliability gain per token, similar to pouring water into a container where the level rises evenly across the most efficient points.

the analysis also expresses the best possible workflow reliability in terms of shadow prices, which measure how much a slight relaxation of latency or cost constraints would improve reliability. this gives a principled way to decide where to invest extra resources. the findings apply to any system that combines llm calls with deterministic processing steps, from customer support bots to data analysis pipelines.

why it matters: it gives ai engineers a clear method to balance speed, accuracy, and spending when building multi-step llm applications.

source: arxiv artificial intelligence: toward reliable design of llm-enabled agentic workflows: optimizing latency-reliability-cost tradeoffs