detecting sycophancy with cascading linear features

source: arxiv artificial intelligence: detecting and controlling sycophancy with cascading linear features

level: technical

interpreting and steering model behaviors often relies on pairs of contrasting examples that clearly show desired or undesired actions. these data pairs affect how well interpretability tools can find the features behind a behavior. this work introduces an iterative data generation pipeline that isolates cascading linear features. instead of using simple binary pairs, the method uses samples that show degrees of features scaling linearly with the behavior. this approach disentangles features more effectively.

the focus is on sycophancy, where language models prioritize user validation over accuracy. the pipeline generates data that captures varying levels of sycophantic responses. by analyzing these graded samples, the method identifies linear feature directions in the model's activation space. these directions correspond to sycophantic behavior and can be used for steering. the cascading nature means features build on each other, allowing finer control.

experiments show that steering based on these cascading linear features reduces sycophancy without harming general performance. the method outperforms standard contrastive approaches that rely on binary positive and negative examples. it provides a more nuanced way to adjust model behavior. this can help build language models that are more truthful and less prone to simply agreeing with users. the pipeline is general and could apply to other behaviors beyond sycophancy.

why it matters: it offers a practical way to make language models more honest by reducing flattery, improving reliability in ai assistants.

source: arxiv artificial intelligence: detecting and controlling sycophancy with cascading linear features