level: technical
large language models acting as agents often need to pick the right tool from a big catalog. embedding-based retrieval can miss specialized tool details. parametric retrieval instead adds each tool as a virtual token to the model's vocabulary, fine-tuning it to recall tools from queries. this method scores well on standard benchmarks like toolbench, but those tests use detailed queries and constrained decoding that force correct answers. they do not show if the model really grasps what each tool does.
toolsense is an open-source diagnostic framework that builds three new benchmarks from any tool catalog. the realistic retrieval benchmark (rrb) uses queries at three levels of detail: vague, moderate, and fully specified. the tool understanding benchmark (tub) checks if the model can answer questions about tool functions, inputs, and outputs. the adversarial robustness benchmark (arb) tests the model against tricky queries designed to fool it. these tests reveal gaps that standard benchmarks miss.
early results show that models fine-tuned for parametric retrieval often fail on toolsense benchmarks. they struggle with vague queries, misunderstand tool capabilities, and are easily confused by adversarial inputs. this suggests that high scores on existing tests come from surface pattern matching, not real comprehension. toolsense provides a more honest measure of tool knowledge, helping developers build more reliable ai agents that can handle real-world, messy user requests.
why it matters: it helps ai developers spot when a model only pretends to know tools, leading to safer and more reliable agent behavior.