source: arxiv artificial intelligence: fair outputs, biased internals: causal potency and asymmetry of latent bias in llms for high-stakes decisions

level: research

instruction-tuned language models used for mortgage underwriting produce fair decisions on the surface, but their internal workings tell a different story. researchers tested open-weight models with matched loan applications that differed only in racially-associated names. the models showed no output-level bias, yet their hidden layers amplified demographic information. this means the models learned and strengthened biased associations even though they did not act on them in final predictions.

using activation steering and cross-layer interventions, the study found that suppressed demographic information is still decision-relevant. when researchers reinjected this latent bias at critical layers, it caused near-complete decision reversals. the effect was asymmetric: steering interventions changed outcomes for one demographic group but not the other. this shows that internal representations are not neutral and can be weaponized or accidentally triggered, undermining the fairness of the system.

the findings highlight a dangerous gap between behavioral fairness and internal model states. even if a model passes standard bias tests, its latent representations can still encode harmful stereotypes. this matters for high-stakes domains like lending, hiring, or criminal justice, where hidden biases could be exploited or emerge under distribution shifts. the research calls for new auditing methods that go beyond output analysis to inspect and control internal model representations.

why it matters: latent bias can cause unfair decisions if triggered, so ai systems need deeper auditing beyond output checks.


source: arxiv artificial intelligence: fair outputs, biased internals: causal potency and asymmetry of latent bias in llms for high-stakes decisions