Unsupervised methods for integrative data analysis

dc.contributor.authorHeld, Felix
dc.date.accessioned2024-05-17T07:10:07Z
dc.date.available2024-05-17T07:10:07Z
dc.date.issued2024-05-17
dc.description.abstractUnsupervised data analysis methods are important for data exploration to introduce structure, reduce data dimensions, or extract interpretable knowledge. Integrative analysis of two or more data sets is crucial to gain understanding of local and global effects within and across data sources. Recent technological advancements in large scale collection of single cell data require efficient and scalable methods to process the increasing size of available data. Integration of data sources with secondary data, also known as side information, can improve prediction of missing data and is important for recommender systems. However, many currently existing methods cannot accommodate the scale or complexity of available data. Therefore, there is a need for new methods for unsupervised integrative data analysis that scale well with input data size, can be applied efficiently, and provide flexible support for complex data input. In Paper I, a novel scalable method is proposed which integrates gene clustering of single cell data with selection of cluster-specific gene regulators having sign-consistent correlation and therefore well-defined effect within each cluster. An efficient alternating two-step algorithm for parameter estimation is developed, along with criteria for optimal hyperparameter and cluster count selection. Applications to single cell data demonstrate the methods capability to identify regulators of intratumoral heterogeneity, primarily in neural cancers. In Paper II, a low-rank matrix factorization model is proposed which allows flexible integration of input data sources and produces interpretable estimates of orthogonal latent factors. Parameter estimation is performed efficiently within an ADMM framework and its convergence theory is extended to support embedded manifold constraints such as orthogonality. Simulation studies show that the method performs well in comparison to established methods and the importance of support for flexible data input layouts is demonstrated. The lack of scalable flexible matrix integration methods is addressed in Paper III by reformulating the data integration problem as a graph estimation problem. A novel algorithm is proposed, using matrix denoising and the asymptotic geometry of singular vectors in noise-perturbed low-rank matrices, to perform estimation within the graphical framework. Simulation studies demonstrate the method's high scalability in comparison to established methods. Software packages with easy-to-use interfaces for each paper are publicly available. The methods presented in this thesis contribute to the development of efficient, flexible, and scalable unsupervised methods for integrative data analysis.sv
dc.gup.defencedate2024-06-07
dc.gup.defenceplaceFredagen den 7 juni 2024, kl 13.15, Hörsal Pascal, Matematiska Vetenskaper, Hörsalsvägen 1sv
dc.gup.departmentDepartment of Mathematical Sciences ; Institutionen för matematiska vetenskapersv
dc.gup.dissdb-fakultetMNF
dc.gup.mailfelix.held@gu.sesv
dc.gup.originUniversity of Gothenburg. Faculty of Science.sv
dc.identifier.isbn978-91-8069-599-2 (tryckt)
dc.identifier.isbn978-91-8069-600-5 (PDF)
dc.identifier.urihttps://hdl.handle.net/2077/79445
dc.language.isoengsv
dc.relation.haspartPaper I: Larsson I, Held F, Popova G, Koc A, Kundu S, Jörnsten R, Nelander S. Reconstructing regulatory programs underlying intratumoral heterogeneity and plasticity of cancer using scregclust. https://doi.org/10.1101/2023.03.10.532041sv
dc.relation.haspartPaper II: Held F, Lindbäck J, Jörnsten R. Sparse and Orthogonal Low-rank Collective Matrix Factorization (solrCMF): Efficient data integration in flexible layouts. https://doi.org/10.48550/arXiv.2405.10067sv
dc.relation.haspartPaper III: Held F. Large-scale Data Integration using Matrix Denoising and Geometric Factor Matching. https://doi.org/10.48550/arXiv.2405.10036sv
dc.subjectclustering of regression modelssv
dc.subjectlow-rank matrix factorizationsv
dc.subjectpenalized optimizationsv
dc.subjectADMM with multi-affine constraintssv
dc.subjectorthogonality constraintssv
dc.subjectflexible data layoutssv
dc.subjectgraph structure estimationsv
dc.subjectscalabilitysv
dc.titleUnsupervised methods for integrative data analysissv
dc.typeText
dc.type.degreeDoctor of Philosophysv
dc.type.svepDoctoral thesiseng

Files

Original bundle

Now showing 1 - 3 of 3
No Thumbnail Available
Name:
Omslag Felix Held.pdf
Size:
560.17 KB
Format:
Adobe Portable Document Format
Description:
Cover
No Thumbnail Available
Name:
Kappa Felix Held.pdf
Size:
1.24 MB
Format:
Adobe Portable Document Format
Description:
Thesis frame
No Thumbnail Available
Name:
Spikblad Felix Held.pdf
Size:
326.27 KB
Format:
Adobe Portable Document Format
Description:
Abstract

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.68 KB
Format:
Item-specific license agreed upon to submission
Description: