source: arxiv statistics ml: a bayesian boolean matrix factorization with application to copy number analysis in cancer

level: research

binary data is common in many fields, but standard real-valued factorization methods ignore its discrete nature and produce factors that are hard to interpret. boolean matrix factorization (boomf) decomposes a binary matrix into two lower-rank binary matrices using logical and and or operations. this expresses the data as a boolean disjunction of patterns, which can be more meaningful than additive or rotational decompositions. in cancer genomics, boomf can uncover coordinated feature changes that may drive tumor evolution.

most existing boomf methods are heuristic and greedy. they are sensitive to initialization, often get stuck in local optima, and lack principled model selection or uncertainty quantification. to address these issues, the authors propose bayesian boolean matrix factorization (bbmf). bbmf is a fully conjugate generative model with sparsity-inducing priors. it enforces boolean constraints and yields interpretable latent factors with coherent uncertainty estimates. the model uses gibbs sampling with closed-form full conditionals, making inference straightforward.

the method is demonstrated on copy number data from cancer genomes. by modeling binary gain and loss events, bbmf identifies patterns of coordinated copy number changes across samples. these patterns can highlight potential driver events and tumor subtypes. the bayesian framework allows automatic determination of the number of factors and provides uncertainty measures for the discovered patterns. this helps researchers assess the reliability of the findings and compare alternative models.

why it matters: it gives data scientists a principled way to extract interpretable binary patterns from discrete data with uncertainty estimates, useful for genomics and other binary datasets.


source: arxiv statistics ml: a bayesian boolean matrix factorization with application to copy number analysis in cancer