Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019

Supplemental Digital Content is available in the text.


Introduction
The PhysioNet/Computing in Cardiology Challenge is an international competition for open-source solutions to complex physiological signal processing and medical classification problems [1]. In 2019, the Challenge's 20th year, we asked participants to develop automated techniques for the early detection of sepsis from clinical data [2].
Sepsis is a life-threatening condition that occurs when the body's response to infection causes tissue damage, organ failure, or death [3][4][5]. Nearly 1.7 million people develop sepsis and 270,000 die from sepsis each year in the U.S., and an estimated 30 million people develop sepsis and 6 million people die from sepsis each year globally [6]. In the U.S., managing sepsis is more expensive than any other health condition, where expenses exceed $24 billion annually or 13% of costs. Altogether, preventing and treating sepsis is a major public health issue with considerable morbidity, mortality, and healthcare costs [7][8][9][10].
Early detection and intervention are critical for improving outcomes of septic patients; each hour of delayed treatment is associated with 4-8% higher mortality [11,12]. However, despite the introduction of new clinical criteria for recognizing sepsis [3][4][5], the fundamental need for early detection and treatment of sepsis remains unmet [13].
Computational approaches promise to improve early sepsis detection. Such approaches typically apply machine learning techniques to clinical data to make real-time predictions up to a day before clinical recognition of sepsis; see [14][15][16] for examples. However, these algorithms frequently address subtly different problems and are developed and tested in different patient cohorts with labels determined using different clinical criteria. We provided a common problem using multiple datasets and the same criteria for sepsis. Moreover, adequately describing such algorithms is a difficult task in a standard journal article. We encouraged participants to release their code in reproducible containers under open-source licenses. Finally, different studies often use varied evaluation metrics, and these metrics typically do not reward algorithms that facilitate early sepsis detection and treatment. We designed a novel metric that addresses this issue and could be generally applicable to infrequent events in sequential prediction tasks. Therefore, while computational approaches demonstrate a potential for early sepsis predictions, the limits of such approaches remain unknown. The Phys-ioNet/Computing in Cardiology Challenge 2019 provides an opportunity to address these issues.
To discourage teams from designing algorithms with limited applicability, we imposed three key constraints on participants. First, we posted data from two separate hospital systems and sequestered data from a third system so that algorithms that overfit on the shared databases would underperform on the third database. Second, each team's algorithm was scored only once on the third database, preventing sequential training on the hidden data. Third, we evaluated the similarity between algorithms to identify teams that attempted to circumvent the rules by repeated scoring.

Data Source
We sourced data for the Challenge from three geographically distinct U.S. hospital systems with three different electronic medical record (EMR) systems. These data were collected over the past decade with approval from the appropriate Institutional Review Boards. We deidentified and posted the data and labels for 40,336 patients from two of the three hospital systems as public training sets and sequestered the data and labels for 22,761 patients from three hospital systems as hidden test sets.
The Challenge data consist of 40 clinical variables, including 8 vital sign summaries, 26 laboratory values, and 6 patient descriptions; [2] provides a detailed discussion of the data.

Expert Labeling
We labeled the data using the Sepsis-3 clinical criteria [3][4][5]. For each septic patient, we identified three time points: • t suspicion : Clinical suspicion of infection is the earlier of intravenous (IV) antibiotics and blood cultures within a specified duration. If IV antibiotics were given first, then the cultures must have been obtained within 24 hours. If cultures were obtained first, then IV antibiotic must have been ordered within 72 hours. IV antibiotics must have been administered for at least 72 consecutive hours. • t SOFA : Occurrence of organ failure as identified by a two-point increase in the Sequential Organ Failure Assessment (SOFA) score within a 24-hour period. • t sepsis : Onset of sepsis is the earlier of t suspicion and t SOFA as long as t SOFA occurs no more than 24 hours before or 12 hours after t suspicion .
Septic patients have t sepsis < ∞, and non-septic patients have t sepsis = ∞.

Challenge Objective
The goal of this Challenge is the design of algorithms for early predictions of sepsis using routinely available clinical data. We asked participants to design and implement working, open-source algorithms that can, based only on the provided clinical data, automatically identify a patient's risk of sepsis and make a positive or negative prediction of sepsis for every hourly time window in the patient's clinical record. In particular, we asked participants to predict sepsis at least 6 hours but no more than 12 hours before the onset of sepsis according to Sepsis-3 clinical criteria. To evaluate each algorithm, we designed a clinical utility-based scoring metric that prioritizes algorithms that make actionable predictions. The winners of the Challenge are the team whose algorithm gives predictions with the highest utility score on a hidden test set containing patient records from three hospital systems.

Challenge Scoring
We evaluated each algorithm's predictions using a novel metric that we created for this Challenge. To better capture the clinical utility of sepsis detection and treatment, this metric rewards algorithms for early sepsis predictions in septic patients, and it penalizes algorithms for late or missed sepsis predictions in septic patients and for sepsis predictions in non-septic patients.
Each algorithm makes a binary sepsis prediction for each hourly time window of each patient's record. We define a score for each prediction and aggregate these scores over all hourly time windows in all patient records to provide a score for the algorithm on a dataset. Let x(s, t) = 1 indicate a positive sepsis prediction for patient s at time t and x(s, t) = 0 otherwise, and let δ(s) = 1 if patient s is eventually septic and δ(s) = 0 otherwise. We define a utility score where U TP (s, t), U FP (s, t), U FN (s, t), and U TN (s, t) are illustrated in Fig. 1 for an example septic patient with sepsis onset t sepsis = 48 and an example non-septic patient; the times in the plot are given as examples.
We compute a total utility score for each algorithm and normalize it to define a normalized  utility score so that an optimal algorithm achieving the highest possible score receives a normalized score of 1 and a completely inactive algorithm that makes only negative predictions receives a normalized score of 0. Each team was allowed five scored submissions during an unofficial phase of the Challenge from 8 February 2019 to 19 April 2019 and ten scored submissions during an official phase from 25 April 2019 to 25 August 2019. Each algorithm received a normalized utility score (3) on the test set from hospital system A. The algorithm with the highest normalized utility score on the full test set from all three hospital systems wins.

Challenge Scoring Mechanism
Participants were required to submit their sepsis prediction algorithms through a cloud-based submission system. This approach encouraged reproducibility and gave participants the ability to validate their algorithms on real-world data without releasing the sequestered test data.
The submission system used containers that were orchestrated, as pipelines, on Google Cloud. Participants packaged their entries in Docker containers, and the submission system created a pipeline with the entry and our scoring function that it launched on Google Cloud.
Each entry was run in a virtual machine with 2 CPUs and 12 GB of RAM, and each entry was allowed 24 hours of run time on each hidden test set.

Results
A total of 104 teams from academia and industry submitted a total of 853 entries during the official phase of the Challenge with 430 successful entries from 88 teams. Each successful entry received its score on the test data for hospital system A, and each team nominated its favorite successful entry for evaluation on the test data for hospital systems B and C. Table 1 summarizes the teams with the highest-scoring entries. Unsurprisingly, algorithms generally performed worse on test data from hospital system C than hospital systems A and B.

Discussion
The PhysioNet/Computing in Cardiology Challenge 2019 asked participants to develop automated, open-source algorithms for the early detection of sepsis from clinical data. We assembled 63,087 patient records from three hospital systems. By posting data from two hospitals publicly and sequestering data from all three hospitals, we provided participants the opportunity to create training methodologies that do not overfit to one medical center. The third hidden database provided a strong indication of how well participants had accomplished this critical task.
We proposed and used a novel evaluation metric that captures the clinical utility of early sepsis detection, weighted by the relative "earliness" or "lateness" of the prediction. This metric could be considered for wider adoption in clinical care because it does not suffer from many of the problems of current metrics that either assume a one-shot decision (accuracy, F -measure, etc.) or no decision threshold (area under the curve metrics).
These efforts provide a more complete picture of how algorithms can provide early sepsis predictions.  Table 1. Clinical utility scores for the teams with the five highest scores on the full test set from hospital systems A, B, and C (Final Score) as well as their scores on the separate test sets from hospital systems A, B, and C (Score A, Score B, and Score C, respectively). * denotes the highest-scoring unofficial entry.
tional Institutes of Health-sponsored Research Resource for Complex Physiologic Signals (www.physionet.org) (R01GM104987). The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors declare no conflict of interest.