level: research
in federated language modeling, multiple nodes hold private data and cannot share full-precision gradients or weights. each node can upload at most a fixed number of bits per query for a public probe set. the goal is to estimate a conditional distribution over a vocabulary of tokens. federated probe-logit distillation (fpld) lets each node send a quantized logit vector, and a central aggregator distills a global student model. prior work gave an upper bound on the kl divergence rate, but it was unknown if the bandwidth term was tight or how to handle nodes with different upload limits.
this paper closes both gaps. first, it proves a matching lower bound for the bandwidth term using a dithered fpld construction. the lower bound is omega of one over the number of nodes times two to the power of negative two times bits per vocabulary size. this shows the previous upper bound cannot be improved in its dependence on bandwidth. second, the analysis generalizes the upper bound to heterogeneous per-node bandwidth budgets. the new bound accounts for nodes with varying upload capacities, making fpld practical for real-world settings where devices have different network conditions.
the results confirm that fpld achieves minimax optimal estimation under communication constraints. the matching lower bound means no method can beat this rate in the worst case. the heterogeneous extension allows federated learning systems to work with a mix of high- and low-bandwidth clients without sacrificing theoretical guarantees. this is important for deploying language models on edge devices like phones or iot sensors, where upload speeds vary widely. the work provides a solid theoretical foundation for communication-efficient federated distillation.
why it matters: it gives tight theoretical guarantees for communication-efficient federated learning, enabling reliable model training across devices with different upload speeds.