teamtr fine-tunes multi-agent llms with trust regions

source: arxiv machine learning: teamtr: trust-region fine-tuning for multi-agent llm coordination

level: research

multi-agent llm systems often do worse than single models because of a problem called compounding occupancy shift. when agents share context and are fine-tuned one after another, updating one agent changes the context distribution for the next. if later updates use cached rollouts from before the change, the mismatch grows with each agent. the authors prove this penalty scales quadratically with the number of agents when using stale rollouts, but only linearly if rollouts are refreshed after each update.

teamtr fixes this by resampling trajectories after every agent update and limiting how much each agent can change. this trust-region approach gives guaranteed improvement at each step and each stage. the method controls per-agent divergence so the team stays coordinated. experiments show teamtr beats single-agent and standard sequential fine-tuning baselines by 7.1% on average across several reasoning tasks.

the work provides a theoretical foundation for why naive sequential fine-tuning fails in multi-agent setups. by formalizing the compounding occupancy shift, it shows that simply updating agents in order without refreshing rollouts leads to a coordination breakdown. teamtr's resampling and trust-region constraints directly address this, making multi-agent fine-tuning more reliable. the results suggest that careful management of context distribution is key to getting multi-agent llm systems to work better than single models.

why it matters: it gives a principled way to fine-tune multi-agent llm teams so they actually outperform single models, which is crucial for complex reasoning tasks in ai.

source: arxiv machine learning: teamtr: trust-region fine-tuning for multi-agent llm coordination