Analyzing Big EHR Data: Optimal Cox Regression Subsampling Procedure With Rare Events
May 2023
in “
Journal of the American Statistical Association
”
TLDR A new method makes analyzing large datasets with rare events faster and more efficient.
This study addresses the computational challenges posed by massive survival datasets, particularly in the context of the UK-biobank colorectal cancer data, which includes genetic and environmental risk factors. The authors propose a Cox regression subsampling procedure to handle right-censored and possibly left-truncated data with rare events, where observed failure times are a small portion of the sample. By assigning optimal sampling probabilities to censored observations and including all observed failures, the method approximates full-data partial-likelihood-based estimators, reducing computation time and memory requirements. The methodology's asymptotic properties are established, and simulation studies demonstrate its finite sample performance. The research is supported by the Israel Science Foundation and Tel-Aviv University Center for AI and Data Science.