Unsupervised methods for integrative data analysis

No Thumbnail Available

Date

2024-05-17

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Unsupervised data analysis methods are important for data exploration to introduce structure, reduce data dimensions, or extract interpretable knowledge. Integrative analysis of two or more data sets is crucial to gain understanding of local and global effects within and across data sources. Recent technological advancements in large scale collection of single cell data require efficient and scalable methods to process the increasing size of available data. Integration of data sources with secondary data, also known as side information, can improve prediction of missing data and is important for recommender systems. However, many currently existing methods cannot accommodate the scale or complexity of available data. Therefore, there is a need for new methods for unsupervised integrative data analysis that scale well with input data size, can be applied efficiently, and provide flexible support for complex data input. In Paper I, a novel scalable method is proposed which integrates gene clustering of single cell data with selection of cluster-specific gene regulators having sign-consistent correlation and therefore well-defined effect within each cluster. An efficient alternating two-step algorithm for parameter estimation is developed, along with criteria for optimal hyperparameter and cluster count selection. Applications to single cell data demonstrate the methods capability to identify regulators of intratumoral heterogeneity, primarily in neural cancers. In Paper II, a low-rank matrix factorization model is proposed which allows flexible integration of input data sources and produces interpretable estimates of orthogonal latent factors. Parameter estimation is performed efficiently within an ADMM framework and its convergence theory is extended to support embedded manifold constraints such as orthogonality. Simulation studies show that the method performs well in comparison to established methods and the importance of support for flexible data input layouts is demonstrated. The lack of scalable flexible matrix integration methods is addressed in Paper III by reformulating the data integration problem as a graph estimation problem. A novel algorithm is proposed, using matrix denoising and the asymptotic geometry of singular vectors in noise-perturbed low-rank matrices, to perform estimation within the graphical framework. Simulation studies demonstrate the method's high scalability in comparison to established methods. Software packages with easy-to-use interfaces for each paper are publicly available. The methods presented in this thesis contribute to the development of efficient, flexible, and scalable unsupervised methods for integrative data analysis.

Description

Keywords

clustering of regression models, low-rank matrix factorization, penalized optimization, ADMM with multi-affine constraints, orthogonality constraints, flexible data layouts, graph structure estimation, scalability

Citation