level: research
distinguishing real adverse drug events from coincidences is hard in pharmacovigilance. the inferbert framework uses transformers and do-calculus to find causal links, but its performance depends on the classification model inside. this study tested four models on two benchmarks: analgesics-induced acute liver failure and tramadol-related mortalities. the models were xgboost as a baseline, albert from the original inferbert, biobert a biomedical transformer, and med-llama a medical large language model.
the evaluation used 5-fold cross-validation repeated 20 times. metrics included accuracy, expected calibration error before and after isotonic regression, and jaccard concordance of causal sets. results showed that xgboost often matched or beat the transformer models in accuracy, suggesting simpler models can work well. biobert improved calibration and causal concordance over albert, showing the value of domain-specific pretraining. med-llama did not consistently outperform smaller models, and scaling to large language models did not guarantee better causal detection.
post-hoc calibration with isotonic regression reduced expected calibration error across all models, making probability estimates more reliable. the study highlights that model selection is critical for causal inference in drug safety. practitioners should not assume larger or more complex models are always better. instead, they should test different classifiers and consider calibration to improve the reliability of causal claims from observational data.
why it matters: choosing the right model and calibrating it can make drug safety monitoring more accurate, helping to identify true harmful drug effects faster.