Survival Analyses of COVID-19 Patients in a Turkish Cohort: Comparison between Using Time to Death and Time to Release

Survival analyses of COVID-19 data has its own unique features, in particular, the existence of two distinct events: death and release from the hospital within a very short period of time. This multiple-event situation belongs to a type where the occurrence of the first event prevents the second event to happen, and vice versa. We carried out two cause-specific univariate Cox regression survival analyses, one for time-to-death and another for time-to-release. Each survival analysis is further split into one for onset of symptom to event time and another for hospitalization to event time. We have also carried out a case-control (death vs. release) analysis without considering the time to event information. We observed that risk factors can be detected by either case-control or survival analysis, even though the goal of the two is quite different. We also observed that the two survival analyses may not both reveal a factor being a risk factor, but only one of them does. We prefer this two rounds of Cox regressions over mixture cure model which is only focused on time-to-death events which usually are sample size limited. By utilizing time-to-release events may greatly increase the sample size needed for revealing risk factors for COVID-19.


Introduction
COVID-19 is a new infectious disease causing global pandemic on a scale not seen in one hundred years, since the 1918 flu pandemic. Much has been learned about the disease in the past year: it is caused by the coronavirus SARS-Cov-2 (previously called 2019-nCov), a single-stranded RNA virus of 30kb genome size with around a dozen genes [1]; the disease is unusually infectious, perhaps made even more so after mutations such as an aspartic acid (D) to glycine (G) mutation at position 614 amino acids of the spike protein gene [2,3], an asparagine (N) to tyrosine (Y) mutation at position 501 in B.1.1.7 variant ("UK variant") [4], and glutamine (E) to lysing (K) mutation at position 484 for several variants including B.1.351 ("South Africa variant") [5], P.1 ("Brazil/Japan variant") [6], B.1.526 ("New York variant") [7]; though COVID-19 is a respiratory disease targeting primarily lung, the virus invades epithelial cells in blood vessels and other organs and it is far from a single-organ disease.
Epidemiology studies on COVID-19 are extensive. Older age is a top risk factor for death from COVID-19. Comorbidities such as obesity and diabetes are also dominant risk factors [8][9][10]. Most of the studies are static in the sense that association between a factor and an outcome is tested disregard of any time factor. A severe COVID-19 patient may recover quickly from a treatment or die after a long battle in ICU despite all efforts by the doctors. The time the patient spent in hospital towards either total recovery or death is an extra piece of information to reveal how a factor conveys risk.
We all know that time-to-event-outcome is studied in survival analysis. In regression survival analysis (e.g. Cox proportional-hazard model), the dependent variable is the time from a baseline point to the event, plus a status value dealing with right censored samples where status=0 if it is unknown whether the event occurs or not. The best example of survival analysis in medical research is cancer survival: event is cancer caused death, and time is the time lapse from the day of (e.g.) diagnosis to the day of death. If a patient is alive or quits the study, the time is the time lapse from the day of diagnosis to the day at the end of the study (or the day when the patient quits); and his/her status is zero.
In order to apply survival analysis to COVID-19 patient data, we have to deal with new situations. First, unlike cancer where there is only one kind of event, i.e., death (being alive is considered as having an unknown status towards the eventual death), there are two kinds of event for COVID-19: death and released from the hospital. Second, cancer survival time is often measured by years whereas COVID-19 patients stay in hospital for a much shorter time. It is quite likely that there are no right-censored samples in the dataset, and all patients are either released/cured or died. How do we handle two time to the events: time to death and time to release?
In Allison (2014) [11] study, five situations are discussed when there are multiple kinds of events (I, IIa, IIb, IIc, IId). Let us assume there are only two kinds of events, and the dependent variable is = ( , ) where status=1 (first kind of event),2 (second kind of event), or 0 (right censored). Situation-I: one mechanism determines the time value, whereas another mechanism determines if status is 1 or 2. Situation-IIa: if a person has a status=1 event, the person no longer has the risk to have the status=2 event (and vice versa). Situation-IIb: if a person has a status=1 event, the person is no longer available for observation of a status=2 event (and vice versa). Situation-IIc: the risk for two kind of events are independent. Situation-IId: if a person has a status=1 event, the risk for the status=2 event is changed (and vice versa).
Obviously, if a COVID-19 patient dies, he/she would not be released as a healthy person, and if a patient is released we consider his COVID-19 disease is cured and will no longer die from COVID-19. These observations put COVID-19 time-to-event data under the situation-IIb group. And the recommended steps is to carry about two rounds of survival analysis, with the first round treating event-1 as status=1 and event-2 as status=0, and the second round switching the two [11].
A portion of the field of competing-risk survival analysis is dominated by the study of correct estimation of survival curve under multiple events where some event (e.g. surgical operation) contributes to the survival/nonsurvival status of a patient. Without considering the contribution from the new type of event, the Kaplan-Meier estimation of the survival curve would be biased. In particular, there is a subdistribution (sd) method by Fine & Gray (1999) [12]. In Fine-Gray approach, the estimated of hazard ratio (sdHR) is guaranteed in the same direction of the relative risk (i.e., > 1 ↔ > 1)to have a refresher on the definition of both [13]. Some authors support the use of Fine-Gray method in COVID-19 data [14], and others actually use it [15].
However, not only the interpretation of sdHR can be confusing (e.g., P Allison (2018), "For causal analysis of competing risks, don't use Fine & Gray's subdistribution method", http://statisticalhorizons.com/for-causal-analysisof-competing-risks), but also the release and death events are not one preceding another, but mutually exclusive, a situation not compatible with the situation for applying the Fine-Gray method. Many papers have explored the relationship between cause-specific hazard ratio (csHR), which is resulted from our practice described above as Allison's situation IIb, and sdHR, e.g. [16][17][18][19]. There are following conclusions: even if the proportional hazard ratio assumption is violated, it is still ok to calculate csHR, and csHR correctly measures the effect of a factor on hazard for a sample who has not yet experienced any one of the two events. If the goal of a survival analysis is to establish association with a factor and not to estimate the survival curve, csHR is a better choice than sdHR.
We carry out this two-event survival analysis (i.e., two rounds of survival analysis by using either time-to-death or either-to-release) to estimate cause-specific hazard ratio and the corresponding p-values on a Turkish hospital COVID-19 patients. Intuitively, we may expect the two rounds of survival analyses would detect the same risk factors but in opposite directions: the presence of a risk factor value may shorten the time-to-death, but the absence of the risk value may also shorten the time-to-release. Our comparative analysis will show that the real situation is more complicated. We also carry out "static" analysis or case-control analysis by ignoring the time-to-event information. The case samples being the deceased patients and control samples are the released ones. Intuitively, we may not expect casecontrol analysis and survival analysis would lead to the same conclusion as the questions asked are different. However, we find the results from the two analyses are actually more close to each other than expected.

Data and Methods
Patient data: Patients (n=109) over 18 years of age who were admitted with COVID-19 in the Tokat State Hospital were included. Patients were treated on COVID-19 regular wards in COVID-19 services, intermediate care and intensive care units (ICU) run by different departments. All patients had a laboratory confirmed case of COVID-19 as defined by a positive result on a reverse-transcriptase-polymerase-chain-reaction (RT-PCR) test of nasopharyngeal swab. Exclusions consisted of patients who left the hospital against medical advice and those who transfered to external hospitals. We record the symptoms prompting presentation to the event time, and time of hospitalization days to release or death.
Admission data included demographics such as gender, age, and job type. We did not use the job type because of small sample sizes per category. Comorbidities were recorded in the following eight categories: lung disease (COPD or other chronic pulmonary disease), heart disease (coronary artery disease/ischemic cardiomyopathy or heart failure NYHA II-IV), diabetes mellitus, chronic liver disease active malignancy, primary or secondary immunodeficiency, obesity (body mass index (BMI) > 30kg/m2), and hypertension (two nurse administered measures by mercury blood pressure monitor with both systolic/diastolic blood pressure higher than 140/90). Each person can have one, multiple, or no comorbidities. We did not use cancer, liver disease, and immunodeficiency because of the small sample size per comorbidity group. Other patient's information include tobacco history, and if the patient is a health care worker.
We have the following common laboratory blood test measures: complete blood count (CBC) with absolute lymphocyte count (including red blood cell count, white blood cell count, platelet count, lymphocyte count, lymphocyte percentage, neutrophil count, neutrophil percentage, monocyte count, eosinophil count), metabolic chemistry panel (BMP/CMP) (including albumin), C-reactive protein (CRP), fibrinogene, D-dimer and other proteins. Laboratory data depicted the first test result occurring within the first 24 hours of hospitalization. Some patient's information has obvious association with mortality and these are not included as factors, including pneumonia status, degree of pneumonia, being in intensive care unit (ICU), being on incubation machine.

Programs used:
All analyses were carried out by R ( www.r-project.org) packages including survvial (functions: coxph, surfit), and the standard statistical functions (fisher.test, wilcox.test, glm( ... family="binomial") ). The survival analysis with competing risks by Fine-Gray subdistribution function model [12] is carried out by the cmprsk R package.

Results
General description of the time-to-events and the analysis tasks: For each patient, we have the time to event (either death or release) from two different starting points: the time of onset of symptom and the time to be admitted to hospital. The cumulative distribution of the time-to-event for four situations is shown in Fig.0. We can see from Fig.0 that although symptom-to-event time is longer than hospitalization-to-event time, the difference between the two is small. Generally speaking the to-event time is longer for death events than for release events. The longest to-event time is one deceased patients who stayed in hospital for 94 days.
We ran three different analyses on the COVID-19 patients data, with distinct goals. The first survival analysis, by Cox regression, on time-to-death data is to find factors where the patient dies earlier when its value increases (positive coefficient) or decrease (negative coefficient). The second survival analysis, also by Cox regression, on time-to-release data is to find factors where the patient release late when its value increases (negative coefficient) or decrease (positive coefficient). The third analysis is the "static" case-control analysis where patients' risk factor is compared between the deceased group and the recovered group. For binary factors, we use the Fisher's test, and for continuous factors, we use the t-test and Wilcoxon non-parametric test. We have added a logistic regression for both categorical (qualitative) and continuous (quantitative) factors.
All our test results (p-values) of three different types are shown in Table 1. For survival analysis, there are two sets of results, one for onset of symptoms to event time, and another for hospitalization to event time. We have added a sign in Table 1 to indicate the sign of Cox coefficients (or whether the hazard ratio is less or larger than 1). Table 1 shows that if a factor is significant in the static case-control analysis (we tentatively mark those p-values smaller than 0.005 in boldface, following a recommendation in Ioannidis (2018) [20] study, though another choice is to use 0.001 threshold [21], most likely, it will also show up as being significant in one of the survival analyses. The seemingly exceptions to this, monocyte, eosinophil, is only due to a specific choice of the pvalue threshold: the p-values from static and temporal analysis are actually comparable.

Risk factors detected by static (case-control) analysis are also risk factors in one of the survival analyses, and vice versa:
On the other hand, if a factor is significant in one of the two survival analyses, it is also (most of the time, depending on the meaning of "significance") significant in the case-control analysis. The seemingly exception is hypertensionbut again, the p-values in the two situations are comparable.
Opposite direction of a risk factor in two survival analyses: Intuitively, (higher value in) a factor making a patient die earlier should make a surviving patient to stay in hospital longer. In other words, log-hazard-ratio for a given factor should have opposite signs. This can be confirmed by the results in Table 1, where the sign attached to a p-value is the sign for log-hazard-ratio. Whenever there seem to be an exception, e.g., smoking, we found that the pvalue does not indicate a statistically significant result. We then expect the hazard ratio fluctuates around the value of 1, and the sign for log-hazard-ratio is either positive or negative by chance.
We may have only one of the two survival analyses produces statistically significant results: By intuition, we would expect the same factor would both significantly make deceased patients die earlier and significantly make surviving patients release later. However, it is more likely by the result in Table 1 that only one survival analysis leads to statistical significant result, whereas another survival analysis does not.
There could be two explanations. The first is the sample size difference between the deceased and survived samples. There are 84 patients who were released from the hospital, whereas there are only 25 patients who died. As contribution to the Cox regression fitting is mainly from samples with the event, it is likely that we see more statistically significant results for time-to-release analysis than that from time-to-dead analysis.
Another possibility is that a risk factor only makes a deceased patient die faster, or makes a survived patient to stay in hospital longer, but not both. Fig.1 shows the survival curves for patients with and without a comorbidity (diabetes, hypertension, ischemic heart disease, obstructive pulmonary disease, and either one of them), in time-to-death and time-to-release survival analysis. For diabetes the survival curve on the left does not give any visual impression that it consistently affects the time-to-disease value, and it does not seem to be due to small sample size. On the other hand, for ischemic heart disease, and for having any one of the four diseases, the time-to-disease survival analysis does show a trend, and that trend could become significant if the sample size is larger. Table 1. P-values from three types of analyses (time-to-death survival analysis, time-to-release survival analysis, casecontrol analysis) and 6-7 tests (four Cox regressions, Fisher test for discrete factors, t-test/Wilcoxon-test for continuous factors, logistic regression) for various factors for COVID-19 patients. P-values smaller than 0.005 are marked with boldface. The sign attached is defined as following: for Cox regression, sign of the log(hazard ratio), for t-test/Wilcoxon test, sign of mean(death group)mean(release group), for Fisher's test, sign of log(odds-ratio) preferring high value in the death group, and for logistic regression, the sign of the regression coefficient. The fabrino and albumin factors are marked by asterisk because they have a very high missing data rate (54% and 67%).

Discussion
Since the COVID-19 pandemic in early 2020, there have been many studies to investigate the possible risk factors for infection, severity of the disease, and mortality. Many of these analyses are straightforward case-control analysis [22], i.e., compare the outcome of people with and without certain risk factor value (discrete factors), or its scale (continuous factors). There are also some survival analysis for time-to-death in severe patients [23][24][25][26][27][28], time-to-release [29], and length of hospital stay [30].
In this paper, we explicitly run three different analyses which address different questions. Case-control analysis asks which factor contributing to a person's outcome, time-to-death survival analysis asks a factor's contribution to a faster death, and time-to-release survival analysis switches the role played by risk factor and protection factor, aiming at finding protection factor that reduce the hospital stay time before release or finding risk factors that lengthens the time. Since these three analyses address different questions, it is not automatic that they should lead to the same results.
The comparison of results from our three different analyses (Table 1) shows that a factor can be a risk in all three different senses, but other factors can be a proven risk only in one or two senses. Our multiple analyses also utilize the dataset more effectively: one may have less number of deceased patients but more number of cured and released patients; then by using the time-to-release survival analysis, more information on factors can be extracted.
Although we prefer the use of cause-specific hazard ratio over sdHR, we still wonder what conclusions the Fine-Gray model leads Table 2 shows the p-values from survival analysis of Fine-Gray model of competing risks for both symptom-to-event and hospitalization-to-event time, for discrete factors only. Comparing Tables 1 and 2, it can be seen that gender remains to be a weak risk factor, smoking and health worker status are still not risks. Lack of comorbidities remain to be a protection for released patients. Interestingly, comorbidities are not significant risk factors for deceased patients, but two out of five become significant in Fine-Gray model. From Fig.1, we can indeed see slight differences in survival curves between the with and without a comorbidity groups. To summarize all these observation, we conclude that even if the Fine-Gray's approach is used, the conclusion may still be similar to the cause-specific HR approach. An alternative analysis framework is the mixture cure model where the released patients are considered as cured with a probability [31][32][33]. There are several arguments caution on the use of this approach: first, the survival curve of all samples has to be flat in the large time-to-event value, but all of the time-to-event values in our data are small, making it difficult to see the plateau; second, our sample size is too small to fit a model with more parameters; third, the deceased samples, whose number is rather small in our data, play a big role in the mixture cure model. Our causespecific Cox regression can take advantage of the larger number of released patients to find potential risk factors.

Conclusion
In conclusion, we investigated two questions in a COVID-19 survival data. One question is whether survival analysis (considering the time to the event) can detect risk factors not detected by case-control analysis (not considering the time to the event). Another question is time-to-death survival analysis and time-to-release survival analysis may lead to different lists of risk factors. The tentative answers from our n=109 sample size to the above two questions are no and "mostly no but sometimes yes". Datasets with larger sample sizes are needed to make a more conclusive statement.

Funding
The authors received no financial support for the research, authorship, and/or publication of this article.

Acknowledgements
Wentian Li thanks discussion with Yaning Yang and support from the Robert S Boas Center for Genomcis and Human Genetics.

Ethical Approval
Ethical approval was obtained from the Ministry of Health and Tokat Gaziosmanpaşa University, Turkey, and the protocol number was 83116987-278.

Data Availability Statement
The data presented in this study are available on request from the corresponding author.