Presentations

Anticipated

"Overcoming data challenges to measure whole-person health in electronic health records," Invited poster @ Anesthesia Research Day, Wake Forest University School of Medicine, Winston-Salem, NC, April 2025. [Poster]

Abstract: Data from Electronic health records (EHR) present a huge opportunity to operationalize a standardized whole-person health score in the learning health system and identify at-risk patients on a large scale, except they are prone to missingness and errors. Ignoring these data quality issues could lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients’ data can be validated and most protocols do not recover missing data. Using a representative sample of 1000 patients from the EHR at an extensive learning health system (100 of whom could be validated), we propose methods to design, conduct, and analyze statistically efficient and robust studies of the ALI and healthcare utilization. Targeted validation with an enriched protocol allowed us to ensure the quality and promote the completeness of the EHR. Findings from our validation study were incorporated into statistical models, which indicated that worse whole-person health was associated with higher odds of engaging in the healthcare system, adjusting for age.

“Statistically and computationally efficient conditional mean imputation for censored covariates," Invited talk @ Lifetime Data Science (LiDS) Conference, Brooklyn, NY, May 2025. [Slides]

Abstract: New scientific questions sometimes introduce new statistical challenges, like censored covariates rather than outcomes. Many promising strategies to tackle censored covariates are shared with the missing data literature, including imputation. However, censored covariates are accompanied by partial information about their actual values, so they are often imputed with conditional means to incorporate this partial information. Estimating these conditional means requires (i) estimating the conditional survival function of the censored covariate and (ii) integrating over that estimated survival function. Most often, the survival function is estimated semiparametrically (e.g., with a Cox model) before the integral over it is approximated numerically (e.g., with the trapezoidal rule). While these semiparametric approaches offer robustness, they come at the cost of statistical and computational efficiency. We propose a generalized framework for parametric imputation of censored covariates that offers better statistical precision and requires less computational strain by estimating the survival function parametrically, where conditional means often have an analytic solution.

“Overcoming data challenges to measure whole-person health in electronic health records," Invited talk @ WNAR Annual Meeting, Whistler, BC, June 2025. [Slides]

Abstract: Data from Electronic health records (EHR) present a huge opportunity to operationalize a standardized whole-person health score in the learning health system and identify at-risk patients on a large scale, except they are prone to missingness and errors. Ignoring these data quality issues could lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients’ data can be validated and most protocols do not recover missing data. Using a representative sample of 1000 patients from the EHR at an extensive learning health system (100 of whom could be validated), we propose methods to design, conduct, and analyze statistically efficient and robust studies of the ALI and healthcare utilization. Targeted validation with an enriched protocol allowed us to ensure the quality and promote the completeness of the EHR. Findings from our validation study were incorporated into statistical models, which indicated that worse whole-person health was associated with higher odds of engaging in the healthcare system, adjusting for age.

“Targeted partial validation to make EHR data au-dit they can be: Correcting for data quality issues in the learning health system," Invited talk @ Joint Statistical Meetings, Nashville, TN, August 2025. [Slides]

Abstract: The allostatic load index (ALI) is an informative summary of whole-person health that is predictive of downstream health outcomes. The ALI uses biomarker data to measure cumulative stress on five systems in the body for the general adult population. Borrowing data from electronic health records (EHR) is a promising opportunity to estimate the ALI and potentially identify at-risk patients on a large scale. However, routinely collected EHR data may contain missingness and errors, and ignoring these data quality issues could lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically only a subset of patients’ data can be validated. Thus, we devise a targeted study design (“targeted audit”) to harness the error-prone surrogates from the EHR to identify the most informative patient records for validation. Specifically, the targeted audit design seeks the best statistical precision to quantify the association between ALI and healthcare utilization in logistic regression. In this talk, we detail the process of the targeted audit design and its application to EHR data from Atrium Health Wake Forest Baptist.

“Imputing inequality: Efficient methods for estimating food access in low-access communities," Invited talk @ Royal Statistical Society International Conference, Edinburgh, Scotland, September 2025. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food. Identifying low-access, high-risk communities for targeted interventions is a public health priority. Current methods to quantify food access rely on distance measures that are either computationally simple (like the shortest straight-line route) or accurate (like the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, harnessing the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard" model using map-based distances for all neighborhoods and improved efficiency over the "complete case" model using map-based distances for just the subset. Through a measurement error framework, straight-line distances are leveraged to impute for neighborhoods without map-based distances. Using simulations and data for North Carolina, U.S.A., we quantify the associations of diabetes and obesity with neighborhood-level proximity to healthy foods. Imputation also makes it possible to predict an area's full food access landscape from incomplete data.

2025

“Predicting access, preventing disparities: Imputation solutions for quantifying food access," Contributed talk @ ENAR Spring Meeting, New Orleans, LA, March 2025. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (like the length of the shortest straight-line route) or accurate (like the length of the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, offering improved computational ease and accuracy. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to Using map-based distances for all neighborhoods and improved efficiency over Using map-based distances for just the subset. Information from the straight-line distances can be leveraged to impute for any neighborhoods without map-based distances. Using simulations and data for the Piedmont Triad region of North Carolina, we quantify and compare the associations between diabetes and obesity and neighborhood-level proximity to healthy foods.

“Predicting access, preventing disparities: Statistical solutions for quantifying food access," Invited talk @ Wake Forest University, Translational Science Center Research Retreat on Aging, Winston-Salem, NC, March 2025. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (like the length of the shortest straight-line route) or accurate (like the length of the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, offering improved computational ease and accuracy. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to Using map-based distances for all neighborhoods and improved efficiency over Using map-based distances for just the subset. Information from the straight-line distances can be leveraged to impute for any neighborhoods without map-based distances. Using simulations and data for the Piedmont Triad region of North Carolina, we quantify and compare the associations between diabetes and obesity and neighborhood-level proximity to healthy foods.

“Rerouting our thinking: Computationally simple measures misrepresent the food access landscape," Invited talk @ the University of Pennsylvania, Philadelphia, PA, February 2025. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food. Identifying low-access, high-risk communities for targeted interventions is a public health priority. Current methods to quantify food access rely on distance measures that are either computationally simple (like the shortest straight-line route) or accurate (like the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, harnessing the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard" model using map-based distances for all neighborhoods and improved efficiency over the "complete case" model using map-based distances for just the subset. Through a measurement error framework, straight-line distances are leveraged to impute for neighborhoods without map-based distances. Using simulations and data for North Carolina, U.S.A., we quantify the associations of diabetes and obesity with neighborhood-level proximity to healthy foods. Imputation also makes it possible to predict an area's full food access landscape from incomplete data.

“Electronic health records: With great-ish data comes great responsibility," Invited talk @ Auburn University, Auburn, AL (Virtual), January 2025. [Slides]

Abstract: The allostatic load index (ALI) is a composite measure of whole-person health, which reflects the multiple interrelated physiological regulatory systems that underlie healthy functioning. Data from electronic health records (EHR) present a huge opportunity to operationalize the ALI in the learning health system and identify at-risk patients on a large scale, except they are prone to missingness and errors. Ignoring these data quality issues could lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients’ data can be validated and most protocols do not recover missing data. Using a representative sample of 1000 patients from the EHR at an extensive learning health system (100 of whom could be validated), we propose methods to design, conduct, and analyze statistically efficient and robust studies of the ALI and healthcare utilization. With measurement error methods, we incorporate all available patient information (validated or unvalidated) into statistical models. Using optimal design theory, we examine ways to strategically select the most informative patients for validation. Incorporating clinical expertise, we devise a novel validation protocol to promote the quality and completeness of EHR data.

“Combining EHR data and measurement error methods to quantify the relationships between access to healthy foods and disease,” Invited talk @ International Conference on Health Policy Statistics, San Diego, CA, January 2025. [Slides]

Abstract: Disparities in food access relate to disparities in well-being, leading to disproportionate rates of diseases in communities that face more challenges in accessing healthy food. Identifying low-access, high-risk communities for targeted interventions is a public health priority, but there are limitations with the currently available methods and data that we are working to resolve. Previous studies exploring this topic have used county- or census tract-level data on disease rates and food access, capturing a broad range of diverse communities. However, counties and census tracts cannot provide specific details about the individuals and communities within them. Additionally, these area-level disease rates are often small area estimates (SAEs) with their own uncertainty that should be but is often not, included in the analysis. In this project, we investigate the relationship between the distance from patients' homes to the nearest healthy food store (proximity) and the prevalence of diabetes. How proximity to healthy foods is measured poses an additional challenge, as distance measures are either computationally simple and inaccurate (straight-line distances) or computationally complex and accurate (map-based distances). To approach these questions, we extract patient information (including diabetes diagnoses) from the electronic health record (EHR), geocode patients' home addresses, and calculate straight-line and map-based proximity to healthy foods. Using various health disparities methods, including rate ratios, relative indices of inequality, and concentration curves, we quantify whether patients with farther proximities to healthy foods (indicating worse access) face a higher burden of prevalent diabetes. Finally, we discuss the impact of using inaccurate access measures to quantify health disparities.

2024

“Connecting healthy food proximity and disease: Straight-line vs. map-based distances,” Invited talk @ CMStatistics, London, England, December 2024. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food. Identifying low-access, high-risk communities for targeted interventions is a public health priority. Current methods to quantify food access rely on distance measures that are either computationally simple (like the shortest straight-line route) or accurate (like the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, harnessing the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard" model using map-based distances for all neighborhoods and improved efficiency over the "complete case" model using map-based distances for just the subset. Through a measurement error framework, straight-line distances are leveraged to impute for neighborhoods without map-based distances. Using simulations and data for North Carolina, U.S.A., we quantify the associations of diabetes and obesity with neighborhood-level proximity to healthy foods. Imputation also makes it possible to predict an area's full food access landscape from incomplete data.

"Imputing inequality: Efficient methods for estimating food access in low-access communities," Invited talk @ International Biometric Conference, Atlanta, GA, December 2024. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (like the length of the shortest straight-line route) or accurate (like the length of the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard'' model using map-based distances for all neighborhoods and improved efficiency over the "complete case'' model using map-based distances for just the subset. Through the adoption of a measurement error framework, information from the straight-line distances can be leveraged to compute informative placeholders (i.e., impute) for any neighborhoods without map-based distances. Using simulations and data for the Piedmont Triad region of North Carolina, we quantify and compare the associations between various health outcomes (diabetes and obesity) and neighborhood-level proximity to healthy foods. The imputation procedure also makes it possible to predict the full landscape of food access in an area without requiring map-based measurements for all neighborhoods.

"Imputing inequality: Efficient methods for estimating food access in low-access communities," Invited talk @ Vanderbilt University, Nashville, TN, November 2024. [Slides]

Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (like the length of the shortest straight-line route) or accurate (like the length of the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard'' model using map-based distances for all neighborhoods and improved efficiency over the "complete case'' model using map-based distances for just the subset. Through the adoption of a measurement error framework, information from the straight-line distances can be leveraged to compute informative placeholders (i.e., impute) for any neighborhoods without map-based distances. Using simulations and data for the Piedmont Triad region of North Carolina, we quantify and compare the associations between various health outcomes (diabetes and obesity) and neighborhood-level proximity to healthy foods. The imputation procedure also makes it possible to predict the full landscape of food access in an area without requiring map-based measurements for all neighborhoods.

“Overcoming data challenges to estimate whole person health in the academic learning health system,” Invited talk (seminar, co-presented with Dr. Joseph Rigdon) @ Center for Artificial Intelligence Research, Wake Forest University School of Medicine, Winston-Salem, NC (Virtual), October 2024.

Abstract: The allostatic load index (ALI) is an informative summary of whole-person health that is predictive of downstream health outcomes. The ALI uses biomarker data to measure cumulative stress on five systems in the body for the general adult population. Borrowing data from electronic health records (EHR) is a promising opportunity to estimate the ALI and potentially identify at-risk patients on a large scale. However, routinely collected EHR data may contain missingness and errors, and ignoring these data quality issues could lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically only a subset of patients’ data can be validated. Thus, we devise a targeted study design (“targeted audit”) to harness the error-prone surrogates from the EHR to identify the most informative patient records for validation. Specifically, the targeted audit design seeks the best statistical precision to quantify the association between ALI and healthcare utilization in logistic regression. In this talk, we detail the process of the targeted audit design and its application to EHR data from Atrium Health Wake Forest Baptist.

“Missing and misclassified wells: Challenges in quantifying the HIV reservoir from dilution assays,” Invited talk @ Women in Statistics and Data Science Conference, Reston, VA, October 2024. [Slides]

Abstract: People living with HIV on antiretroviral therapy often have undetectable virus levels by standard assays, but latent" HIV still persists in viral reservoirs. Eliminating these reservoirs is the goal of HIV cure research. Dilution assays, including the quantitative viral outgrowth assay (QVOA) and more detailed Ultra Deep Sequencing Assay of the outgrowth virus (UDSA), are commonly used to estimate the reservoir size, i.e., the infectious units per million (IUPM) of HIV-persistent resting CD4+ T cells. This paper considers efficient statistical inference about the IUPM from combined dilution assay (QVOA) and deep viral sequencing (UDSA) data, even when some deep sequencing data are missing. Moreover, existing inference methods for the IUPM assumed that the assays are "perfect" (i.e., they have 100% sensitivity and specificity), which can be unrealistic in practice. The proposed methods accommodate assays with imperfect sensitivity and specificity, wells sequenced at multiple dilution levels, and include a novel bias-corrected estimator for small samples. The proposed methods are evaluated in a simulation study, applied to data from the University of North Carolina HIV Cure Center, and implemented in the open-source R package SLDeepAssay.

“Operationalizing a whole-hospital, whole-person health score in the EHR despite data quality issues,” Invited talk @ International Day of Women in Statistics and Data Science, Virtual, October 2024. [Slides]

Abstract: The allostatic load index (ALI) is an informative summary of whole-person health, drawing upon biomarkers to measure lifetime strain. Borrowing data from electronic health records (EHR) is a natural way to estimate whole-person health and identify at-risk patients on a large scale. However, these routinely collected data contain missingness and errors, and ignoring these data quality issues can lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients' data can be validated. Thus, we consider strategic ways to harness the error-prone ALI from the EHR to target the most informative patient records for validation. Specifically, the validation study is designed to achieve the best statistical precision to quantify the association between ALI and healthcare utilization in a logistic regression model. Further, we propose a semiparametric maximum likelihood estimator for this model, which robustly corrects data quality issues in unvalidated records while preserving the power of the full cohort. Through simulations and an application to the EHR of an extensive academic learning health system, targeted partial validation and the semiparametric estimator are shown to be effective and efficient ways to correct data quality issues in EHR data before using them in research.

“Missing and misclassified wells: Challenges in quantifying the HIV reservoir from dilution assays,” Invited talk (department seminar) @ National Institute of Allergy and Infectious Diseases, Biostatistics Research Branch, Rockville, MD, October 2024. [Slides]

Abstract: People living with HIV on antiretroviral therapy often have undetectable virus levels by standard assays, but latent" HIV still persists in viral reservoirs. Eliminating these reservoirs is the goal of HIV cure research. Dilution assays, including the quantitative viral outgrowth assay (QVOA) and more detailed Ultra Deep Sequencing Assay of the outgrowth virus (UDSA), are commonly used to estimate the reservoir size, i.e., the infectious units per million (IUPM) of HIV-persistent resting CD4+ T cells. This paper considers efficient statistical inference about the IUPM from combined dilution assay (QVOA) and deep viral sequencing (UDSA) data, even when some deep sequencing data are missing. Moreover, existing inference methods for the IUPM assumed that the assays are "perfect" (i.e., they have 100% sensitivity and specificity), which can be unrealistic in practice. The proposed methods accommodate assays with imperfect sensitivity and specificity, wells sequenced at multiple dilution levels, and include a novel bias-corrected estimator for small samples. The proposed methods are evaluated in a simulation study, applied to data from the University of North Carolina HIV Cure Center, and implemented in the open-source R package SLDeepAssay.

“Combining straight-line and map-based distances to quantify neighborhood-level food access and its impact on health,” Invited poster @ Joint Statistical Meetings, Portland, Oregon, August 2024. [Poster]

Abstract: Healthy foods are essential for a healthy life, but not everyone has the same access to healthy foods, leading to disproportionate rates of diseases in low-access communities. Current methods to quantify food access rely on distance measures that are either computationally simple (the shortest straight-line route) or accurate (the shortest map-based route), but not both. We combine these food access measures through a multiple imputation for measurement error framework, leveraging information from less accurate straight-line distances to compute informative placeholders (i.e., impute) more accurate food access for any neighborhoods without map-based distances. Thus, computationally expensive map-based distances are only needed for a subset of neighborhoods. Using simulations and data for Forsyth County, North Carolina, we quantify and compare the associations between the prevalence of various health outcomes and neighborhood-level food access. Through imputation, predicting the full landscape of food access for all neighborhoods in an area is also possible without requiring map-based measurements for all neighborhoods.

“Connecting healthy food proximity and disease: Straight-line vs. map-based distances,” Keynote address @ Big Data Summer Institute, University of Michigan, Ann Arbor, Michigan, July 2024. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food. Identifying low-access, high-risk communities for targeted interventions is a public health priority. Current methods to quantify food access rely on distance measures that are either computationally simple (like the shortest straight-line route) or accurate (like the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, harnessing the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard" model using map-based distances for all neighborhoods and improved efficiency over the "complete case" model using map-based distances for just the subset. Through a measurement error framework, straight-line distances are leveraged to impute for neighborhoods without map-based distances. Using simulations and data for North Carolina, U.S.A., we quantify the associations of diabetes and obesity with neighborhood-level proximity to healthy foods. Imputation also makes it possible to predict an area's full food access landscape from incomplete data.

“Semiparametrically correcting for data quality issues to estimate whole-hospital, whole-body health from the EHR,” Invited talk @ International Symposium on Nonparametric Statistics, Braga, Portugal, June 2024. [Slides]

Abstract: The allostatic load index (ALI) is an informative summary of whole-person health, drawing upon biomarkers to measure lifetime strain. Borrowing data from electronic health records (EHR) is a natural way to estimate whole-person health and identify at-risk patients on a large scale. However, these routinely collected data contain missingness and errors, and ignoring these data quality issues can lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients' data can be validated. Thus, we consider strategic ways to harness the error-prone ALI from the EHR to target the most informative patient records for validation. Specifically, the validation study is designed to achieve the best statistical precision to quantify the association between ALI and healthcare utilization in a logistic regression model. Further, we propose a semiparametric maximum likelihood estimator for this model, which robustly corrects data quality issues in unvalidated records while preserving the power of the full cohort. Through simulations and an application to the EHR of an extensive academic learning health system, targeted partial validation and the semiparametric estimator are shown to be effective and efficient ways to correct data quality issues in EHR data before using them in research.

“Adjusting for misclassification to quantify the relationship between diabetes and local access to healthy foods,” Invited talk @ WNAR/IBS/Graybill Annual Meeting, Fort Collins, CO, June 2024. [Slides]

Abstract: Healthy foods are essential for a healthy life, but not everyone has the same access to healthy foods, leading to disproportionate rates of diseases in low-access communities. Current methods to quantify food access rely on distance measures that are either computationally simple (the shortest straight-line route) or accurate (the shortest map-based route), but not both. Communities can be classified as having high or low access to healthy foods if they have at least one healthy foods store within 1 mile based on either distance measure. However, straight-line distance measures underestimate actual distance, leading some communities to be misclassified as having high access when they do not. We propose a maximum likelihood estimator for Poisson regression with a misclassified exposure. This estimator uses error-prone straight-line food access measures for all communities enriched with error-free map-based ones for a targeted subset, offering reduced computational burden. Using simulations and data for the Piedmont Triad, North Carolina, we quantify and compare the associations between the prevalence of diabetes and community-level food access with and without correcting for misclassification.

“Extrapolation before imputation reduces bias when imputing heavily censored covariates,” Invited talk @ HiTec Meeting & Workshop on Complex Data in Econometrics and Statistics, Limassol, Cyprus (Virtual), March 2024. [Slides]

Abstract: Modeling symptom progression to identify informative subjects for a new Huntington’s disease clinical trial is problematic since time to diagnosis, a key covariate, can be heavily censored. Imputation is an appealing strategy where censored covariates are replaced with their conditional means, but existing methods saw over 200% bias under heavy censoring. Calculating these conditional means well requires estimating and then integrating over the survival function of the censored covariate from the censored value to infinity. To flexibly estimate the survival function, existing methods use the semiparametric Cox model with Breslow’s estimator. Then, for integration, the trapezoidal rule is used, but the trapezoidal rule is not designed for improper integrals and leads to bias. We propose calculating the conditional mean with adaptive quadrature instead, which can handle the improper integral. Yet, even with adaptive quadrature, the integrand (the survival function) is undefined beyond the observed data, so we identify the “Weibull extension” as the best method to extrapolate and then integrate. In simulation studies, we show that replacing the trapezoidal rule with adaptive quadrature and adopting the Weibull extension corrects the bias seen with existing methods. We further show how imputing with corrected conditional means helps to prioritize patients for future clinical trials.

“Targeting the most informative patients for EHR validation using error-prone surrogates,” Invited talk @ ENAR Spring Meeting, Baltimore, MD, March 2024. [Slides]

Abstract: The allostatic load index (ALI) is an informative summary of whole-person health, drawing upon biomarkers to measure lifetime strain. Borrowing data from electronic health records (EHR) is a natural way to estimate whole-person health and identify at-risk patients on a large scale. However, these routinely collected data contain missingness and errors, and ignoring these data quality issues can lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients' data can be validated. Thus, we devise an optimal study design to harness the error-prone surrogates from the EHR to target the most informative patient records for validation. Specifically, the optimal design seeks the best statistical precision to quantify the association between ALI and healthcare utilization in Poisson regression. Through simulations and an application to the EHR of an extensive academic learning health system, targeted partial validation is shown to be an effective and efficient way to correct data quality issues in EHR data before using them in research.

“Overcoming computational hurdles to quantify food access and its impact on health: a statistical approach,” Invited talk @ Davidson College, Davidson, NC, February 2024. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. With this disparity in food access comes disparities in well-being, leading to disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority. However, current methods to quantify food access rely on distance measures that are either computationally simple (the length of the shortest straight-line route) or accurate (the length of the shortest map-based route), but not both. We propose a hybrid statistical approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the “gold standard” model using map-based distances for all neighborhoods and improved efficiency over the “complete case” model using map-based distances for just the subset. Adopting a measurement error framework allows information from the straight-line distances to be leveraged to compute informative placeholders (i.e., impute) for any neighborhoods without map-based distances. Using simulations and data for Forsyth County, North Carolina, we quantify and compare the associations between various health outcomes (coronary heart disease, diabetes, high blood pressure, and obesity) and neighborhood-level food access. The imputation procedure also makes it possible to predict the full landscape of food access in an area without requiring map-based measurements for all neighborhoods.

“Extrapolation before imputation reduces bias when imputing heavily censored covariates,” Invited talk @ Banff International Research Station, Banff, Canada, January 2024. [Slides]

Abstract: Modeling symptom progression to identify informative subjects for a new Huntington’s disease clinical trial is problematic since time to diagnosis, a key covariate, can be heavily censored. Imputation is an appealing strategy where censored covariates are replaced with their conditional means, but existing methods saw over 200% bias under heavy censoring. Calculating these conditional means well requires estimating and then integrating over the survival function of the censored covariate from the censored value to infinity. To flexibly estimate the survival function, existing methods use the semiparametric Cox model with Breslow’s estimator. Then, for integration, the trapezoidal rule is used, but the trapezoidal rule is not designed for improper integrals and leads to bias. We propose calculating the conditional mean with adaptive quadrature instead, which can handle the improper integral. Yet, even with adaptive quadrature, the integrand (the survival function) is undefined beyond the observed data, so we identify the “Weibull extension” as the best method to extrapolate and then integrate. In simulation studies, we show that replacing the trapezoidal rule with adaptive quadrature and adopting the Weibull extension corrects the bias seen with existing methods. We further show how imputing with corrected conditional means helps to prioritize patients for future clinical trials.

2023

“Challenges in quantifying the HIV reservoir from dilution assays - Overcoming missingness and misclassification,” Invited talk @ CMStatistics, Berlin, Germany, December 2023. [Slides]

Abstract: People living with HIV on antiretroviral therapy often have undetectable virus levels by standard assays, but latent" HIV still persists in viral reservoirs. Eliminating these reservoirs is the goal of HIV cure research. Dilution assays, including the quantitative viral outgrowth assay (QVOA) and more detailed Ultra Deep Sequencing Assay of the outgrowth virus (UDSA), are commonly used to estimate the reservoir size, i.e., the infectious units per million (IUPM) of HIV-persistent resting CD4+ T cells. This paper considers efficient statistical inference about the IUPM from combined dilution assay (QVOA) and deep viral sequencing (UDSA) data, even when some deep sequencing data are missing. Moreover, existing inference methods for the IUPM assumed that the assays are "perfect" (i.e., they have 100% sensitivity and specificity), which can be unrealistic in practice. The proposed methods accommodate assays with imperfect sensitivity and specificity, wells sequenced at multiple dilution levels, and include a novel bias-corrected estimator for small samples. The proposed methods are evaluated in a simulation study, applied to data from the University of North Carolina HIV Cure Center, and implemented in the open-source R package SLDeepAssay.

"Combining straight-line and map-based distances to quantify the impact of neighborhood-level food access on health," Invited talk @ Purdue University, Department of Statistics, West Lafayette, IN, November 2023. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. With this disparity in food access comes disparities in well-being, leading to disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (the length of the shortest straight-line route) or accurate (the length of the shortest map-based route), but not both. We propose a hybrid statistical approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the “gold standard” model using map-based distances for all neighborhoods and improved efficiency over the “complete case” model using map-based distances for just the subset. Through the adoption of a multiple imputation for measurement error framework, information from the straight-line distances can be used to fill in map-based measures for the remaining neighborhoods. Using data for Forsyth County, North Carolina, and its surrounding counties, we quantify and compare the associations between the prevalence of coronary heart disease and different measures of neighborhood-level food access.

"Overcoming computational hurdles to quantify the impact of food access on health: A statistical approach," Invited talk @ Women in Statistics and Data Science Conference, Bellevue, WA, October 2023. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. With this disparity in food access comes disparities in well-being, leading to disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority. Still, current methods to quantify food access rely on distance measures that are either computationally simple (the length of the straight-line route) or accurate (the length of the map-based route), but not both. Using data for Forsyth County, North Carolina, we first quantify and compare the associations between various health outcomes (coronary heart disease, diabetes, high blood pressure, and obesity) and neighborhood-level food access based on (i) straight-line distance versus (ii) map-based distance for all neighborhoods. We then propose a hybrid statistical approach to combine these methods, allowing researchers to harness the former’s computational ease with the latter’s accuracy. Specifically, we adopt a measurement error framework, incorporating food access based on straight-line distance for all neighborhoods (an “error-prone” covariate) and based on map-based distance for just a subset (an “error-free” covariate). By imputing map-based food access for the remaining neighborhoods, the hybrid model offers comparable estimates to the “gold standard” model with food access based on map-based distance for all neighborhoods and preserves the statistical efficiency of the whole study. We will also briefly discuss how open-source tools like the Google Maps API were used to collect data for this study, including grocery store addresses and geocoded locations.

"Combining straight-line and map-based distances to quantify the impact of neighborhood-level food access on health," Invited talk @ the University of South Carolina, Department of Statistics, Columbia, SC, September 2023. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. With this disparity in food access comes disparities in well-being, leading to disproportionate rates of diseases in communitiesthat face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measuresthat are either computationally simple (the length of the shortest straight-line route) or accurate (the length of the shortest map-based route), but not both. We propose a hybrid statistical approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the “gold standard” model using map-based distances for all neighborhoods and improved efficiency over the “complete case” model using map-based distances for just the subset. Through the adoption of a multiple imputation for measurement error framework, information from the straight-line distances can be used to fill in map-based measures for the remaining neighborhoods. Using data for Forsyth County, North Carolina, and its surrounding counties, we quantify and compare the associations between various health outcomes (coronary heart disease, diabetes, high blood pressure, and obesity) and different measures of neighborhood-level food access.

"Statistical methods to accurately and efficiently quantify access-based disparities in health," Invited talk @ the University of Wisconsin-Madison, Department of Statistics, Madison, WI, September 2023. [Slides]

Abstract: Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. With this disparity in food access comes disparities in well-being, leading to disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance t are either computationally simple (the length of the shortest straight-line route) or accurate (the length of the shortest map-based route), but not both. We propose a hybrid statistical approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the “gold standard” model using map-based distances for all neighborhoods and improved efficiency over the “complete case” model using map-based distances for just the subset. Through the adoption of a multiple imputation for measurement error framework, information from the straight-line distances can be used to fill in map-based measures for the remaining neighborhoods. Using data for Forsyth County, NorthCarolina, and its surrounding counties, we quantify and compare the associations between health outcomes (coronary heart disease, diabetes, high blood pressure, and obesity)and different measures of neighborhood-level food access.

“Optimal multi-wave validation of secondary use data with outcome and exposure misclassification,” Contributed talk (topic-contributed session) @ Joint Statistical Meetings, Toronto, Canada, August 2023. [Slides]

Abstract: The growing availability of observational databases like electronic health records (EHR) provides unprecedented opportunities for secondary use of such data in biomedical research. However, these data can be error-prone and must be validated before use. It is usually unrealistic to validate the whole database due to resource constraints. A cost-effective alternative is to implement a two-phase design that validates a subset of patient records that are enriched for information about the research question of interest. Herein, we consider odds ratio estimation under differential outcome and exposure misclassification. We propose optimal designs that minimize the variance of the maximum likelihood odds ratio estimator. We develop a novel adaptive grid search algorithm that can locate the optimal design in a computationally feasible and numerically accurate manner. Because the optimal design requires specification of unknown parameters at the outset and thus is unattainable without prior information, we introduce a multi-wave sampling strategy to approximate it. We demonstrate the proposed designs' efficiency gains through extensive simulations and two large observational studies.

“Efficient estimation of the HIV reservoir with partial deep sequencing data,” Invited talk @ International Chinese Statistics Association Applied Statistics Symposium, Ann Arbor, MI, June 2023. [Slides]

Abstract: People living with HIV on antiretroviral therapy often have undetectable virus levels by standard assays, but “latent" HIV still persists in viral reservoirs. Eliminating these reservoirs is the goal of HIV cure research. The quantitative viral outgrowth assay (QVOA) is commonly used to estimate the reservoir size, i.e., the infectious units per million (IUPM) of HIV-persistent resting CD4+ T cells. A new variation of the QVOA, the Ultra Deep Sequencing Assay of the outgrowth virus (UDSA), was recently developed that further quantifies the number of viral lineages within a subset of infected wells. Performing the UDSA on a subset of wells provides additional information that can improve IUPM estimation. This paper considers statistical inference about the IUPM from combined dilution assay (QVOA) and deep viral sequencing (UDSA) data, even when some deep sequencing data are missing. The proposed methods accommodate assays with wells sequenced at multiple dilution levels and include a novel bias-corrected estimator for small samples. The proposed methods are evaluated in a simulation study, applied to data from the University of North Carolina HIV Cure Center, and implemented in the open-source R package SLDeepAssay.

“It's integral: Replacing the trapezoidal rule to remove bias and correctly impute censored covariates with their conditional mean,” Invited talk @ ENAR Spring Meeting, Nashville, TN, March 2023. [Slides]

Abstract: Modeling symptom progression to prioritize subjects for a new Huntington's disease clinical trial is problematic since key covariate time to diagnosis can be censored. Imputation is an appealing strategy where censored covariates are replaced with their conditional means, but existing methods saw over 100% bias. Calculating these conditional means well requires estimating and integrating over the survival function of the censored covariate from the censored value to infinity. To flexibly estimate the survival function, existing methods use the Cox model with Breslow's estimator. Then, for integration, the trapezoidal rule, which is not designed for indefinite integrals, is used. This leads to bias. We propose a calculation that handles the indefinite integral with adaptive quadrature. Yet, even with adaptive quadrature, the integrand (the survival function) is undefined beyond the observed data. We identify the best method to extrapolate. In simulations, we show that replacing the trapezoidal rule with adaptive quadrature (plus extrapolation) corrects the bias. We further show how imputing with corrected conditional means helps prioritize patients for clinical trials.

2022

“Overcoming censored predictors with imputation to model the progression of Huntington's disease,” Invited talk @ CMStatistics, London, UK (Virtual), December 2022. [Slides]

Abstract: Clinical trials to test experimental treatments for Huntington's disease are expensive, so it is prudent to enroll subjects whose symptoms may be most impacted by the treatment during follow-up. However, modeling how symptoms progress to identify such subjects is problematic since time to diagnosis, a key predictor, can be censored. Imputation is an appealing strategy where censored predictors are replaced with their conditional means, the calculation of which requires estimating and integrating over its conditional survival function from the censored value to infinity. However, despite efforts to make conditional mean imputation as flexible as possible, it still makes restrictive assumptions about the censored predictor (such as proportional hazards) that may not hold in practice. We develop a suite of extensions to conditional mean imputation to encourage its applicability to a wide range of clinical settings. We adopt new estimators for the conditional survival function to offer more efficient and robust inference and propose an improved conditional mean calculation. We discuss in simulations when each version of conditional mean imputation is most appropriate and evaluate our methods as we model symptom progression from Huntington's disease data. Our imputation suite is implemented in the open-source R package, imputeCensRd.

“One size fits all: A generalized algorithm to fitting GLMs with censored predictors In R,” Contributed talk @ Women In Statistics and Data Science Conference, St. Louis, MO, October 2022. [Slides]

Abstract: Ever since the discovery of therapies that target the genetic root cause of Huntington disease, researchers have worked to test if these therapies can slow or halt the disease symptoms. A first step towards achieving this is modeling how symptoms progress to know when the best time is to initiate a therapy. Because symptoms are most detectable before and after a clinical diagnosis, modeling how symptoms progress has been problematic since the time to clinical diagnosis is often censored (i.e., for patients who have not yet been diagnosed). This creates a pressing statistical challenge for modeling how symptoms (the outcome) change before and after time to clinical diagnosis (a censored predictor). Strategies to tackle this challenge include fitting a generalized linear model with a censored covariate using maximum likelihood estimation. Still, implementation of these models can be taxing because each new setting (i.e., different outcome models and distributions for the censored predictor) requires a new algorithm to be derived. To this end, we have created the glmCensRd package, which includes generalized linear model fitting functions for a multitude of outcome and (censored) predictor specifications and various random and non-random censoring types. The glmCensRd package makes fiKng generalized linear models in R as accessible with censored predictors as without.

"Getting thrifty with data quality: Efficient two-phase designs for error-prone data," Invited talk @ Wake Forest University School of Medicine, Department of Biostatistics and Data Science, Winston-Salem, NC, September 2022. [Slides]

Abstract: Clinically meaningful variables are increasingly becoming available in observational databases like electronic health records (EHR). However, these data can be error-prone, giving misleading results in statistical inference. Data auditing can help maintain data quality but is often unrealistic for entire databases (especially large ones like EHR). A cost-effective solution is the two-phase design: error-prone variables are observed for all patients during Phase I and that information is used to select patients for auditing during Phase II. However, even these partial audits can be expensive. To this end, we propose methods to promote the statistical efficiency of two-phase designs, ensuring the integrity of observational cohort data while maximizing our investment. First, given the resource constraints imposed upon data audits, targeting the most informative patients is paramount for efficient statistical inference. Using the asymptotic variance of the maximum likelihood estimator, we compute the most efficient design under complex outcome and exposure misclassification. Since the optimal design depends on unknown parameters, we propose a multi-wave design to approximate it in practice. We demonstrate the superior efficiency of the optimal designs through extensive simulations and illustrate their implementation in observational HIV studies. Then, to obtain efficient odds ratios with partially audited, error-prone data, we propose a semiparametric analysis approach that uses all information and accommodates many error mechanisms. The outcome and covariates can be error-prone, with correlated errors, and selection of Phase II records can depend on Phase I data in an arbitrary manner. We devise an EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate advantages of the proposed methods through extensive simulations and provide applications to a multi-national HIV cohort.

“New frontiers for conditional mean imputation: Overcoming censored predictors to model the progression of Huntington's disease,” Invited talk @ Joint Statistical Meetings, Washington DC, August 2022. [Slides]

Abstract: Since discovering therapies that target the genetic root of Huntington's disease, researchers have investigated whether these therapies can slow or halt symptoms and, if so when is best to initiate treatment. Symptoms are most detectable before and after clinical diagnosis, but modeling their progression is problematic since the time to clinical diagnosis is often censored. This creates a pressing statistical challenge: modeling how symptoms (the outcome) change across time to clinical diagnosis (a censored predictor). Conditional mean imputation is an appealing strategy, replacing censored times of clinical diagnoses. However, despite efforts to make conditional mean imputation flexible, it still makes restrictive assumptions (like proportional hazards) that may be unrealistic. We develop a suite of extensions to conditional mean imputation by incorporating estimators for the conditional distribution of the censored predictor to offer more efficient and robust inference. We discuss in simulations when each version of conditional mean imputation is most appropriate and evaluate our methods in Huntington's disease data.

“New frontiers for conditional mean imputation: Modeling the progression of Huntington's disease despite censored covariates,” Invited poster @ 22nd Meeting of New Researchers in Statistics and Probability, Fairfax, VA, August 2022. [Poster]

Abstract: Since discovering therapies that target the genetic root cause of Huntington's disease, researchers have been investigating whether these therapies can slow or halt the disease symptoms. To identify the best time to initiate treatment, a first step is modeling how symptoms progress in prospective cohorts. Symptoms are most detectable before and after a clinical diagnosis, but modeling how symptoms progress has been problematic since the time to clinical diagnosis is often censored (i.e., for patients thus far undiagnosed). In contrast to traditional survival analysis, this creates a pressing statistical challenge as we model how symptoms (the outcome) change relative to clinical diagnosis (a censored covariate). Conditional mean imputation is an appealing strategy since it is an intuitive way to “fill in the gaps” in our data by estimating all censored times to clinical diagnoses. Depending on how we calculate the conditional means, imputation can offer both efficiency and robustness in statistical inference. However, despite efforts to make conditional mean imputation flexible, it still makes restrictive assumptions (such as proportional hazards) that may not hold in practice. We set out to develop a suite of extensions that could offer even more robustness and efficiency in a wide range of clinical settings. Along the way, we discovered that the published conditional mean formula itself needed some improvements. Through extensive simulations, we quantify the bias introduced by the published formula and show that our proposed method does a good job of correcting for it. We also apply our methods to Huntington's disease study data to evaluate how imputing with biased conditional means can impact clinical decision-making.

“Design and analysis strategies with 'secondary use' data,” Invited talk @ International Chinese Statistics Association Applied Statistics Symposium, Gainesville, FL, June 2022. [Slides]

Abstract: The growing availability of observational databases like electronic health records (EHR) provides unprecedented opportunities for secondary use of such data in biomedical research. However, these data can be error-prone and need to be validated before use. It is usually unrealistic to validate the whole database due to resource constraints. A cost-effective alternative is to implement a two-phase design that validates a subset of patient records that are enriched for information about the research question of interest. In this talk, I will discuss proper statistical approaches to analyze such two-phase studies, which can efficiently use the information in the unvalidated data in Phase I and address the potential biased validation sample selection in Phase II. I will demonstrate the advantages of the proposed methods over existing ones through extensive simulations and an application to an ongoing HIV observational study.

"Advancing conditional mean Imputation for statistical modeling with censored predictors," Invited talk @ Brigham and Women's Hospital & Harvard Medical School, Department of Pharmacoepidemiology and Pharmacoeconomics, Boston, MA (Virtual), April 2022. [Slides]

Abstract: Since discovering therapies that target the genetic root of Huntington's disease, researchers have investigated whether these therapies can slow or halt symptoms and, if so, when the best time is to initiate treatment. Symptoms are most detectable before and after clinical diagnosis, but modeling their progression is problematic since the time to clinical diagnosis is often censored. This creates a pressing statistical challenge: modeling how symptoms (the outcome) change across time to clinical diagnosis (a censored predictor). Conditional mean imputation is an appealing strategy since it is a simple way to estimate all times of clinical diagnoses that are censored. However, despite efforts to make conditional mean imputation as flexible as possible, it still makes restrictive assumptions about the censored predictor (such as proportional hazards) that may not hold in practice. We develop a suite of extensions to conditional mean imputation to encourage its applicability to a wide range of clinical settings. We incorporate additional estimators for the conditional distribution of the censored predictor to offer more efficient and robust inference and propose an improved conditional mean calculation. Through extensive simulations, we discuss when each version of conditional mean imputation is most appropriate.

"See a need, fill a need: Tackling our research questions using data and statistics," [Slides]

Guest lecture @ Syracuse University for SPM295: Research Methods, Department of Sports Management, Syracuse, NY (Virtual), March 2022.
Guest lecture @ Lipscomb University for SPM295: COUN 5603/PSG 5603: Research Methods and Statistics, Departments of Clinical Mental Health Counseling/Psychology, Nashville, TN (Virtual), March 2022.

“Advancing conditional mean imputation for censored predictors,” Contributed talk @ ENAR Spring Meeting, Houston, TX, March 2022. [Slides]

Abstract: Since discovering therapies that target the genetic root of Huntington's disease, researchers have investigated whether these therapies can slow or halt symptoms and, if so, when the best time is to initiate treatment. Symptoms are most detectable before and after clinical diagnosis, but modeling their progression is problematic since the time to clinical diagnosis is often censored. This creates a pressing statistical challenge: modeling how symptoms (the outcome) change across time to clinical diagnosis (a censored predictor). Conditional mean imputation is an appealing strategy since it is a simple way to estimate all times of clinical diagnoses that are censored. However, despite efforts to make conditional mean imputation as flexible as possible, it still makes restrictive assumptions about the censored predictor (such as proportional hazards) that may not hold in practice. We develop a suite of extensions to conditional mean imputation to encourage its applicability to a wide range of clinical settings. We incorporate additional estimators for the conditional distribution of the censored predictor to offer more efficient and robust inference and propose an improved conditional mean calculation. Through extensive simulations, we discuss when each version of conditional mean imputation is most appropriate.

2021

"Optimal multi-wave validation of secondary use data with outcome and exposure misclassification," Contributed talk @ Women in Statistics and Data Science Conference, Virtual, October 2021. [Slides]

Abstract: The growing availability of observational databases like electronic health records (EHR) provides unprecedented opportunities for secondary use of such data in biomedical research. However, these data can be error-prone and need to be validated before use. It is usually unrealistic to validate the whole database due to resource constraints. A cost-effective alternative is to implement a two-phase design that validates a subset of patient records that are enriched for information about the research question of interest. Herein, we consider odds ratio estimation under differential outcome and exposure misclassification. We propose optimal designs that minimize the variance of the maximum likelihood odds ratio estimator. We develop a novel adaptive grid search algorithm that can locate the optimal design in computationally feasible and accurate manner. Because the optimal design requires specification of unknown parameters at the outset, and thus is unattainable without prior information, we introduce a multi-wave sampling strategy to approximate it in practice. We demonstrate the efficiency gains of the proposed designs over existing ones through extensive simulations and two large observational studies. We provide an R package and Shiny app to facilitate the use of the optimal designs.

"Getting thrifty with data quality: Efficient two-phase designs for error-prone data," Invited talk @ the University of South Carolina, Department of Biostatistics and Epidemiology, Columbia, SC, December 2021. [Slides]

Abstract: Clinically meaningful variables are increasingly becoming available in observational databases like electronic health records (EHR). However, these data can be error-prone, giving misleading results in statistical inference. Data auditing can help maintain data quality but is often unrealistic for entire databases (especially large ones like EHR). A cost-effective solution is the two-phase design: error-prone variables are observed for all patients during Phase I and that information is used to select patients for auditing during Phase II. However, even these partial audits can be expensive. To this end, we propose methods to promote the statistical efficiency of two-phase designs, ensuring the integrity of observational cohort data while maximizing our investment. First, given the resource constraints imposed upon data audits, targeting the most informative patients is paramount for efficient statistical inference. Using the asymptotic variance of the maximum likelihood estimator, we compute the most efficient design under complex outcome and exposure misclassification. Since the optimal design depends on unknown parameters, we propose a multi-wave design to approximate it in practice. We demonstrate the superior efficiency of the optimal designs through extensive simulations and illustrate their implementation in observational HIV studies. Then, to obtain efficient odds ratios with partially audited, error-prone data, we propose a semiparametric analysis approach that uses all information and accommodates many error mechanisms. The outcome and covariates can be error-prone, with correlated errors, and selection of Phase II records can depend on Phase I data in an arbitrary manner. We devise an EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate advantages of the proposed methods through extensive simulations and provide applications to a multi-national HIV cohort.

"Getting thrifty with data quality: Efficient two-phase designs for error-prone data," Invited talk @ Wake Forest University, Department of Mathematics & Statistics, Winston-Salem, NC, November 2021. [Slides]

Abstract: Clinically meaningful variables are increasingly becoming available in observational databases like electronic health records (EHR). However, these data can be error-prone, giving misleading results in statistical inference. Data auditing can help maintain data quality but is often unrealistic for entire databases (especially large ones like EHR). A cost-effective solution is the two-phase design: error-prone variables are observed for all patients during Phase I and that information is used to select patients for auditing during Phase II. However, even these partial audits can be expensive. To this end, we propose methods to promote the statistical efficiency of two-phase designs, ensuring the integrity of observational cohort data while maximizing our investment. First, given the resource constraints imposed upon data audits, targeting the most informative patients is paramount for efficient statistical inference. Using the asymptotic variance of the maximum likelihood estimator, we compute the most efficient design under complex outcome and exposure misclassification. Since the optimal design depends on unknown parameters, we propose a multi-wave design to approximate it in practice. We demonstrate the superior efficiency of the optimal designs through extensive simulations and illustrate their implementation in observational HIV studies. Then, to obtain efficient odds ratios with partially audited, error-prone data, we propose a semiparametric analysis approach that uses all information and accommodates many error mechanisms. The outcome and covariates can be error-prone, with correlated errors, and selection of Phase II records can depend on Phase I data in an arbitrary manner. We devise an EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate advantages of the proposed methods through extensive simulations and provide applications to a multi-national HIV cohort.

"Getting thrifty with data quality: Efficient two-phase designs for error-prone data," Invited talk @ the University of North Carolina at Chapel Hill, Department of Biostatistics, Chapel Hill, NC, November 2021. [Slides]

Abstract: Clinically meaningful variables are increasingly becoming available in observational databases like electronic health records (EHR). However, these data can be error-prone, giving misleading results in statistical inference. Data auditing can help maintain data quality but is often unrealistic for entire databases (especially large ones like EHR). A cost-effective solution is the two-phase design: error-prone variables are observed for all patients during Phase I and that information is used to select patients for auditing during Phase II. However, even these partial audits can be expensive. To this end, we propose methods to promote the statistical efficiency of two-phase designs, ensuring the integrity of observational cohort data while maximizing our investment. First, given the resource constraints imposed upon data audits, targeting the most informative patients is paramount for efficient statistical inference. Using the asymptotic variance of the maximum likelihood estimator, we compute the most efficient design under complex outcome and exposure misclassification. Since the optimal design depends on unknown parameters, we propose a multi-wave design to approximate it in practice. We demonstrate the superior efficiency of the optimal designs through extensive simulations and illustrate their implementation in observational HIV studies. Then, to obtain efficient odds ratios with partially audited, error-prone data, we propose a semiparametric analysis approach that uses all information and accommodates many error mechanisms. The outcome and covariates can be error-prone, with correlated errors, and selection of Phase II records can depend on Phase I data in an arbitrary manner. We devise an EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate advantages of the proposed methods through extensive simulations and provide applications to a multi-national HIV cohort.

"Filling in the blanks: Multiply imputing missing data in R," Tutorial @ R-Ladies Research Triangle Park, Virtual, November 2021. [Slides]

2020

"Efficient odds ratio estimation using partial data audits error-prone, observational HIV cohort data," Contributed poster @ International Chinese Statistics Association Applied Statistics Symposium, Virtual, December 2020. [Slides][Poster]

Abstract: Persons living with HIV engage in clinical care often, so observational HIV research cohorts generate especially large amounts of routine clinical data. Increasingly, these data are being used in biomedical research, but available information can be error-prone, and biased statistical estimates can mislead results. The Caribbean, Central, and South America network for HIV epidemiology is one such cohort; fortunately, data audits have been conducted. Risk of AIDS defining event after initiating antiretroviral therapy is of clinical interest, expected to be associated with CD4 lab value and AIDS status. Error-prone values for 5109 patients were in the research database, and validated data were available (substantiated by clinical source documents) on only 117 patients. Instead of naive (unaudited) or complete case (audited) analysis, we propose a novel semi-parametric likelihood method using all available information (unau- dited and audited) to obtain unbiased, efficient odds ratios with error-prone outcome and covariates. Point estimates were farther from the null than the naive analysis, directionality agreed with the complete case analysis but had narrower confidence intervals.

"geogRaphy: An introduction to spatial data In R," Guest lecture @ Vanderbilt University for BIOS6301: Introduction to Statistical Computing, Department of Biostatistics, Nashville, TN (Virtual), November 2020. [Slides][Code][Video]

"Efficient odds ratio estimation using error-prone data from a multi-national HIV research cohort," Invited talk @ Vanderbilt University, Health Policy Grand Rounds, Nashville, TN (Virtual), October 2020. [Slides][Video]

"Odds ratio estimation in error-prone, observational HIV cohort data," Contributed talk @ Women In Statistics and Data Science Conference, Virtual, October 2020. [Slides]

Abstract: Persons living with HIV engage in clinical care often, so observational HIV research cohorts generate especially large amounts of routine clinical data. Increasingly, these data are being used in biomedical research, but available information can be error prone and biased statistical estimates can mislead results. The Caribbean, Central, and South America network for HIV epidemiology is one such cohort; fortunately, data audits have been conducted. Risk of AIDS defining event after initiating antiretroviral therapy is of clinical interest, expected to be associated with CD4 lab value and AIDS status. Error-prone values for 5109 patients were in the research database, and validated data were available (substantiated by clinical source documents) on only 117 patients. Instead of naive (unaudited) or complete case (audited) analysis, we propose a novel semiparametric likelihood method using all available information (unaudited and audited) to obtain unbiased, efficient odds ratios with error prone outcome and covariates. Point estimates were farther from the null than the naive analysis, directionality agreed with the complete case analysis, but had narrower confidence intervals.

"Da-ta day life," Guest lecture @ Syracuse University for SPM295: Research Methods, Department of Sports Management, Syracuse, NY (Virtual), September 2020. [Slides]

"Introduction to missing data," Guest lecture @ Vanderbilt University for BIOS6312: Modern Regression Analysis, Department of Biostatistics, Nashville, TN (Virtual), April 2020. [Slides]

"Robust estimation with outcome misclassification and covariate measurement error in logistic regression," Contributed talk @ ENAR Spring Meeting, Virtual, March 2020. [Slides]

Abstract: Persons living with HIV engage in routine clinical care, generating large amounts of data in observational cohorts. These data are usually error-prone, and directly using them in biomedical research can yield misleading results. A cost-effective solution is the two-phase design, under which the error-prone variables are observed for all patients during Phase I and that information is used to select patients for data auditing during Phase II. For example, the Caribbean, Central, and South America Network for HIV Epidemiology (CCASAnet) selected random samples from each site for data auditing. We consider efficient odds ratio estimation with partially audited, error-prone data and propose a semiparametric approach that accommodates a number of error mechanisms. We allow both the outcome and covariates to be error-prone and errors to be correlated, and selection of Phase II can depend on Phase I data in an arbitrary manner. We devise an EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate the superiority of the proposed methods over existing ones through extensive simulations and provide applications to CCASAnet.

2019

"Robust odds ratio estimation using error-prone data: Correcting for outcome misclassification and covariate measurement error in logistic regression," Invited talk @ Vanderbilt University, Department of Biostatistics, Nashville, TN, November 2019. [Slides]

Abstract: While electronic health records were first intended to support clinical care, financial billing, and insurance claims, these databases are now used for clinical investigations aimed at preventing disease, improving patient care, and informing policymaking. However, both responses and predictors of interest can be captured with errors and their discrepancies correlated. Odds ratios and their standard errors estimated via logistic regression using error-prone data will be biased. A cost-effective solution to a complete data audit is a two-phase design. During Phase I error-prone variables are observed for all subjects, and this information then used to select a Phase II validation subsample. Previous approaches to outcome misclassification using two-phase design data are limited to error-prone categorical predictors and make distributional assumptions about the errors. We propose a semiparametric approach to two-phase designs with a misclassified, binary outcome and categorical or continuous error-prone predictors, allowing for dependent errors and arbitrary second-phase selection. The proposed method is robust because it yields consistent estimates without making assumptions about the predictors’ error mechanisms. An EM algorithm was devised to maximize the likelihood function. The resulting estimators possess desired statistical properties. Performance is compared to existing approaches through extensive simulation studies and illustrated in an observational HIV study.

"Semiparametric maximum likelihood for logistic regression with response and covariate measurement error," Contributed talk @ Joint Statistical Meetings, Denver, CO, August 2019. [Slides]

Abstract: Modern electronic health records systems routinely collect data on variables of clinical interest. In epidemiology, logistic regression can be used to analyze the association between a binary response (e.g. disease status) and predictors of interest (e.g. age or exposure status). However, in observational databases responses and predictors can be differentially misclassified, and their misclassification can be dependent. A cost-effective solution to a complete data audit is the two-phase design. During the first phase, error-prone outcome and covariates are observed for all subjects and this information is then used to select a validation subsample for accurate measurements of these variables in the second phase. Previous measurement error corrections focused on error-prone covariates only or rely on a validation subsample that is simple or stratified. Herein, we propose a semiparametric approach to general two-phase measurement error problems with a binary outcome, allowing for correlated errors in the outcome and covariates and arbitrary second-phase selection. We devise a computationally efficient and numerically stable EM algorithm to maximize the nonparametric likelihood function. The resulting estimators possess desired statistical properties. We compare the proposed methods to existing approaches through extensive simulation studies, and we illustrate their use in an observational HIV study.

"Efficient Inference for two-phase designs with response and covariate measurement error," Contributed talk @ ENAR Spring Meeting, Philadelphia, PA, March 2019. [Slides]

Abstract: In modern electronic health records systems, both the outcome and covariates of interest can be error-prone and their errors often correlated. A cost-effective solution is the two-phase design, under which the error-prone outcome and covariates are observed for all subjects during the first phase and that information is used to select a validation subsample for accurate measurements of these variables in the second phase. Previous research on two-phase measurement error problems largely focused on scenarios where there are errors in covariates only or the validation sample is a simple random sample. Herein, we propose a semiparametric approach to general two-phase measurement error problems with a continuous outcome, allowing for correlated errors in the outcome and covariates and arbitrary second-phase selection. We devise a computationally efficient and numerically stable EM algorithm to maximize the nonparametric likelihood function. The resulting estimators possess desired statistical properties. We compare the proposed methods to existing approaches through extensive simulation studies, and we illustrate their use in an observational HIV study.

"A comparison of self-audits as alternatives to travel-audits in improving data quality In the Caribbean, Central and South America network for HIV epidemiology," Poster @ Vanderbilt Institute of Global Health: Global Health Symposium, Nashville, TN, February 2019. [Poster]

2018

"A comparison of self-audits as alternatives to travel-audits in improving data quality In the Caribbean, Central and South America network for HIV epidemiology," Poster @ International Workshop on HIV and Hepatitis Observational Databases, Fuengirola, Spain, March 2018. [Slides][Poster]

"Impact of distance calculation methods on geospatial analysis of healthcare access," Contributed poster (speed session) @ Women in Statistics and Data Science Conference, Cincinnati, OH, October 2018. [Poster]

Abstract: The CDC publishes the CHSI dataset, providing county-level key health indicators for the United States. The current analysis focuses on the density of primary care physicians per 100,000 people as a gauge of healthcare access in a subset of 878 counties in Tennessee. Access to primary care may be measured by travel distance or travel time. We investigate the implications of travel time and two distance measures: travel distance and travel time using the Google Maps API and shortest distance ("as the crow flies") using the Haversine formula. These measures were used to create empirically estimated (classical) semivariograms. We utilized ordinary kriging methods (OK) to interpolate the density of primary care physicians in the areas between the geographic centroids of the counties (three analyses total). We will compare kriged results between the three methods, illustrate results with maps, and explore the relationships between travel distance/time and Haversine distance.

"Impact of distance calculation methods on geospatial analysis of healthcare access," Contributed poster (speed session) @ Joint Statistical Meetings, Vancouver, BC, August 2018. [Slides][Poster]

Abstract: The CDC publishes the CHSI dataset, providing county-level key health indicators for the United States. The current analysis focuses on the density of primary care physicians per 100,000 people as a gauge of healthcare access in a subset of 878 counties in Tennessee. Access to primary care may be measured by travel distance or travel time. We investigate the implications of travel time and two distance measures: travel distance and travel time using the Google Maps API and shortest distance ("as the crow flies") using the Haversine formula. These measures were used to create empirically estimated (classical) semivariograms. We utilized ordinary kriging methods (OK) to interpolate the density of primary care physicians in the areas between the geographic centroids of the counties (three analyses total). We will compare kriged results between the three methods, illustrate results with maps, and explore the relationships between travel distance/time and Haversine distance.

2016

"Comparing admission and enrollment outcomes through a spatial scope," Poster @ University of Florida Undergraduate Research Symposium, Gainesville, FL, March 2016. [Poster]

Abstract: The university application, admission, and enrollment process involves an intricate sequence of decision-making. Initial consideration depends entirely on a potential student's choice to turn interest into action and submit an application. Subsequent assessments must be made by admissions officers of the applicant’s qualifications and likelihood of success at their institution. Assuming a positive admissions outcome, the final verdict lies in the hands of the individual student: where will they choose to matriculate? The natural means to attaining the best possible class of incoming freshmen is through targeting not only the most qualified but also the most likely to enroll. This paper investigated methods to predict an individual’s probability of gaining admission to and later enrolling at a large, land-grant research university through the use of logit general linear models with additional consideration for location-based predictors, as well as comparative data analysis on characteristics of admitted students from varied distances (based on their high school) through ANOVA, paired t-tests, and odds-ratios. Results of these tests pinpointed applicants coming from 100 – 200 miles from the university with the highest odds of admission and admits living less than 100 miles away with the greatest odds of enrollment. ANOVA and subsequent t-tests indicated that average applicant standards varied between applicants from different distances to the university, which could be explained by institutional efforts in geographic diversity.