Statistical Methods For Phenotyping With Positive-Only Electronic Health Record Data

Loading...
Thumbnail Image

Embargo Date

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology & Biostatistics

Discipline

Subject

Anchor
Calibration
Electronic Health Records
Phenotyping
Positive only
Prediction Accuracy
Biostatistics

Funder

Grant number

License

Copyright date

2022-09-09T20:20:00-07:00

Distributor

Related resources

Contributor

Abstract

Electronic Health Records-based phenotyping requires fully labeled cases and controls for model training and testing. Due to asymmetric clinical workflow, labeled cases can be much more easily identified than labeled controls. Therefore, data from a group of labeled cases and a large number of unlabeled patients, referred to as “positive-only” data, is frequently accessible with minimum requirement for labeling efforts. This dissertation focuses on statistical methods for training and validating phenotyping models using such positive-only EHR data when the labeled cases can be seen as a representative subset of all cases. In project I, we developed an anchor-variable framework and proposed an accompanying maximum likelihood approach to training a logistic phenotyping model. In project II, we developed a Chi-squared test to assess model calibration through comparing the model-free and model-based estimated number of cases among the unlabeled. We also proposed consistent estimators for predictive performance measures and studied their large sample properties. These methods provide the methodological foundation for positive-only data to be routinely used for training and validating phenotyping models. In project III, we extended the MLE method in project I to accommodate high dimensional predictors by enabling automated feature selection through a proxy phenotype that is available for all patients. We performed extensive simulation studies to assess the performance of the proposed methods and applied them to Penn Medicine EHR data to phenotype primary aldosteronism.

Date of degree

2020-01-01

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Volume number

Issue number

Publisher

Publisher DOI

Journal Issues

Comments

Recommended citation