Separating Sparse Signals from Correlated Noise in Binary Classification


Among the goals of statistical genetics is to find sparse associations of genetic data with binary phenotypes, such as heritable diseases. Often, the data are obfuscated by confounders such as age, ancestry, or population structure. A widely appreciated modeling paradigm which corrects for such confounding relies on linear mixed models. These are linear regression models with correlated noise, where the noise covariance captures similarities between the samples. We generalize this modeling paradigm to binary classification. We thereby face the technical challenge that that marginalizing over the noise leads to an intractable, high-dimensional integral. We propose a variational EM algorithm to overcome this problem, where the global model parameters are $l_1$-norm regularized, leading to a sparse solution. The selected features are much less affected by the spurious correlations in the data, manifested by a smaller correlation between the features and the first principal component of the noise covariance. The proposed method also outperforms Gaussian process classification and uncorrelated probit regression in terms of prediction performance. In addition, we discuss ongoing work on employing stochastic gradient MCMC for this problem class.

UAI 2016 Workshop on Causation: Foundation to Application
Oral Presenation

More Publications