Statistical Methods for Human Microbiome Data Analysis

Loading...
Thumbnail Image

Degree type

Doctor of Philosophy (PhD)

Graduate group

Genomics & Computational Biology

Discipline

Subject

High-dimensional statistics
Metagenomics
Microbiome
Variable selection
Bioinformatics
Biostatistics
Microbiology

Funder

Grant number

License

Copyright date

2014-08-19T00:00:00-07:00

Distributor

Related resources

Author

Contributor

Abstract

The human microbiome is the totality of the microbes, their genetic elements and the interactions they have with surrounding environments throughout the human body. Studies have implicated the human microbiome in health and disease. Two central themes of human microbiome studies are to identify potential factors influencing the microbiome composition, and to define the relationship between microbiome features and biological or clinical outcomes. With the development of next generation sequencing technologies, the human microbiome composition can be interrogated using high-throughput DNA sequencing. One strategy sequences the bacterial 16S ribosomal RNA gene for species identification. These 16S sequences are usually clustered into Operational Taxonomic Units (OTUs). Analysis of such OTU data raises several important statistical challenges, including taking into account the phylogenetic relationship among OTUs and modeling high-dimensional overdispersed count data. This dissertation presents three statistical methods developed specifically for 16S data analysis centering around the two themes. To test the association between overall microbiome composition and a covariate/an outcome, a testing procedure based on a generalized UniFrac distance was developed. The generalized UniFrac distance corrects the unduly weighting of classic UniFrac distances on either highly abundant or rare lineages, and was shown to be more powerful than the classic UniFracs. Under the framework of canonical correlation analysis (CCA), a structure-constrained sparse CCA was proposed to select the OTUs and their correlated covariates. A phylogenetic structure-constrained penalty function was imposed to induce certain smoothness on the linear coefficients according to the OTU phylogenetic relationship. Structure-constrained sparse CCA performed much better than sparse CCA in selecting relevant OTUs. Finally, a sparse Dirichlet-multinomial regression (SDMR) model was developed to link the microbiome composition to environmental covariates and to select the most important covariates and their affected OTUs. SDMR accounts for the overdispersion of OTU counts and uses a sparse group L1 penalty function to facilitate selection of covariates and OTUs simultaneously. These methods were illustrated using simulations as well as a real human gut microbiome data set from a study of dietary effects on gut microbiome composition.

Date of degree

2012-01-01

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Volume number

Issue number

Publisher

Publisher DOI

relationships.isJournalIssueOf

Comments

Recommended citation