TRANSFER LEARNING IN CLASSIFICATION AND REGRESSION WITH SUMMARY STATISTICS

Loading...
Thumbnail Image

Files

Zheng_upenngdas_0175C_15808.pdf (1.06 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology and Biostatistics

Discipline

Biology
Statistics and Probability

Subject

Classification
GWAS
High-dimensional models
Linear discriminant analysis
Summary statistics
Transfer learning

Funder

Grant number

License

Copyright date

2023

Distributor

Related resources

Contributor

Abstract

Transfer learning is one of the most active research areas in statistical learning. In this dissertation, we developed transfer learning methods in high dimensional setting. In Chapter 2, we develop a transfer learning method for linear discriminant analysis (Trans-LDA) that effectively utilizes information from auxiliary data sets in order to build a better classification rule for the target study. The methods allow for both homogeneous and heterogeneous covariance matrices across different studies. In addition, an adaptive method together with model aggregation is introduced that identifies the possible informative data sets in transfer learning. We show that under some assumptions, Trans-LDA has smaller error rate in estimating the discriminant direction, and smaller classification error. We illustrate the proposed methods by building a classification for cardiovascular risk of different patient groups of chronic kidney patients based on blood proteomics data and show improved classification by leveraging data sets from different patients’ subgroups. We consider in Chapter 3 estimation and prediction of a high-dimensional linear regression model in the setting of transfer learning, where we only observe summary statistics in the auxiliary studies, together with external data for estimation of linkage disequilibrium (LD). We develop a method for estimation of the regression coefficient and PRS in the target model based on data in the target study, summary statistics in auxiliary studies, and external data for estimating the LD matrix. We show improvement in estimation of the model parameter and PRS when the summary statistics of auxiliary studies are used in transfer learning. However, the convergence rate for estimation is slower than transfer learning methods with individual-level data. We show that such transfer learning methods lead to better predictions of lipid phenotypes using data from Penn Medicine Biobank and the GWAS summary statistics from UK Biobank.

Date of degree

2023

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Volume number

Issue number

Publisher

Publisher DOI

Journal Issues

Comments

Recommended citation