On Fair Performance Comparison between Random Survival Forest and Cox Regression: An Example of Colorectal Cancer Study

Random Forest (RF), a mostly model-free and robust machine learning method, has been successfully applied to right-censored survival data, under the name of Random Survival Forest (RSF). However, RF/RSF has its distinct strategies in classification and prediction. First, it is an ensemble classifier and its performance is an average of multiple rounds of data fitting. Second, the training set is a bootstrap (sampling with replacement) generated set with repeated used of roughly 2/3 of all samples and testing set consists of those not used (out of bag samples). Both features are not intrinsic to Cox regression or other single classifiers. Not considering these two features could potentially lead to a partial comparison between the performance of the two methods. By using a colorectal survival dataset, we illustrate the problems of using k-fold cross-validation, using only one resampling without an ensemble average, and using the whole dataset for both fitting and testing, in Cox regression, when comparing with RSF. We provide a more accessible R code for simple calculation of discordance index (D-index) and unweighted integrated Brier score (IBS) for Cox regression, and unweighted IBS for RSF.


Introduction
In cancer epidemiology studies, one of the most commonly used analyses is Cox regression, which regresses the right-censored time-to-death data on risk factors. In recently years, new methodologies are introduced to survival analysis. An all-embracing name "machine learning" covers a whole spectra of these new techniques [1], with most of them "model free" making less assumption about the data. method leads people to use the default parameter values, and unaware that the performance of RSF may change with the parameter value [5][6][7], as well as not able to compare its performance fairly with standard approaches. The latter problem is partly due to the fact that RSF is an ensemble classifier (data fitter), not the one-time-only run of Cox regression. Performance comparison between the two are often not carried out appropriately. Moreover, RSF has a particular way in choosing training dataset (sample with replacement, or bootstrap) and testing dataset (samples not chosen by the bootstrap).
If the performance of Cox regression is measured from a testing set which is not equivalent to that in RSF, then the estimated performances of RSF and Cox regression are not equivalent. Besides the extremely unfair choice of using the whole dataset as both training and validation, the seemingly fair k-fold cross-validation for Cox regression is also not appropriate, in two different ways. First, it is not equivalent to RSF in that each sample is only used once in training whereas in the bootstrap set, a sample can be used more than once. Second, the typical value for is 10, meaning the performance is averaged over 10 times, whereas the number of runs in RSF ensemble learning is of the order of hundreds or thousands.
Many good practice known by data analysts may not been made explicit or emphasized enough in the literature. We share our experience in our comparison of Cox regression and RSF on a colorectal cancer survival data. Although there are excellent software packages professionally written, we still found it difficult to find an error comparison program that fits our need. Therefore, we share our simple but hopefully accessible code in public domain.
The dependent variable is the right censored time to death in the unit of days. The status = 1 if the i'th person is dead at diagonosis-to-death time , and = 0 if the i'th person is unavailable (censored) at time after diagnosis . The mean diagonosis-to-death time for patients who die is 617 days (1.69 years), whereas the mean time for censored patients is 1128 days (3.09 years). Because the distribution of the time is not normal, mean may not be the best characterization of the distribution. For example, the median, geometric mean, mode, of diagonosis-to-death time are 401, 280, ∼500 days (or 1.1, 0.77, 1.4 years). Thoese for censored patients are 943, 769, 1778 days (or 2.58, 2.1, 4.9 years).
There are other information which is not available for all samples, including: cancer stage information (available on 80% the samples: 9,28,117,24 persons at stage=1,2,3,4), metastasis status (available on 64% of the samples: 61 persons whose cancer is spread, 81 are not), and treatment information (on 111 adjuvan and 49 neoadjuvan). Either because these variable can be too strong predictor of the survival time or because there are too much missing data in them, they are not used as the predicting factors.

Random Survival Forest
We re-introduce Random Survival Forest in more detail because of its importance in understanding the points we are making in this paper. We have the right-censored data where where the dependent variable being ( , ) ( = 1,2, ⋯ is the index for persons), is the status (1 if dead, 0 if unknown at the end time of data collection), and is either the time to the = 1 event or the end time of data collection, for = 0 samples. The independent variable is = ( 1 , 2 , ⋯ ⋯ ), where = 1,2, ⋯ is the index for variables.

1.
A randomly-sampling-with-replacement (bootstrap) dataset is produced. This dataset contains items (same number as the original dataset), but two or more items can be identical because of the "with replacement" requirement. In general, 1 − −1 = 0.632 = 63.2% independent samples from the original dataset are chosen.
2. These items are used to produce a decision tree, i.e., a series of node at which samples are classified (split into two branches) according to some independent variables, so that there is a maximum discriminative power between the two branches: (a) At each node, a randomly selected 0 ≤ independent variables are used to split the samples; (b) Each branch should contain at least an average of samples; (c) Besides #(b), there are other stopping criteria if the discriminative power between the two branches is lower than a threshold.
3. The cumulative hazard function for each person is estimated from the terminal nodes.
4. At each iteration, the error is calculated on those samples that are not chosen by the bootstrap (roughly −1 = 0.368 = 36.8% of the samples) -called out of bag (OOB) samples.

5.
Steps #1 -#4 are iterated times, and the overall error of RSF is an average of errors calculated above.
The first lesson from the above re-introduction is that RSF is an ensemble learner/classifier/data-fitter, each iteration contributes to the overall performance, and the overall error is an average of that in all iterations. Second lesson is that RSF's strategy in training and validation samples partition is different from that in k-fold crossvalidation. In bootstrapped training set, some samples are used multiple times, whereas in k-fold cross-validation, one training sample is used only once. The third point to note is that there are several parameters controlling the tree construction process, and it is taken as granted that their default setting lead to the optimal performance. Let us review parameters described above and their correspondence in the function rfsrc in the R package randomForestSRC: number of rounds (number of trees), ( ntree in rfsrc, with default value of 1000; minimum (after averaging over all nodes) number of samples per node before stopping, ( nodesize in rfsrc, with the deault value of 15).

Programs Used
All survival analyses are carried out by two R statistical packages ( www.r-project.org), including survival [8], randomForestSRC [4]. Some analyses in the Discussion section use the R packages pec [9] and ipred [10].

Measure of Error: Integrated Brier Score
For binary outcomes, the Brier score [11] (at given time point) is simply the mean square difference between the predicted survival probability ( ) and the actual survival situation at that time, if known (index for samples: = 1,2, ⋯ ): Note that when time passed the censored time , we do not know whether the person survives or not, thus the prediction of the survival of that sample can not be checked (shown as NA -not available). The integrated Brier score can be defined as an average of BS( ) at all available time points in the dataset, ( = 1,2, ⋯ , ), and the integral is approximated by a summation: Equation 2 is the "unweighted" IBS: we do not use the weighted IBS because we want to make the calculation as simple as possible.
For randomForestSRC R package [4], the object produced by rfsrc function contains the predicted survival probability for both the OOB samples (rfsrc()$survival.oob) and the "in the bag" training samples (rfsrc()$survival), in a single array of length × (one sample of length followed by the second sample, etc). The available time points { } can be obtained by rfsrc()$time.interest The survfit function from survival R package [8] can be used to obtain the predicted survival probability. The surfit(coxph(), newdata=...)$surv produces a matrix with n-row ( = number of samples) and -column ( = number of time points). The surfit(coxph(), newdata=...)$time is the list of observed time points. Using the predicted survival probability from the output survfit/coxph (from the survival package), we wrote our own R function to calculate IBS for Cox regression [12]. The R function and data that support the findings of this study are openly available in github at github.com/wlicol/coxrsf.

Measure of Error: D-index or One Minus the Concordance Index (C-index)
Concordance index used in survival data is defined as the proportion of correct prediction on survival by the model in all pair of samples. It can also be called Harrell's C-index named after the author of the measurement [13]. More specifically, for any two samples where at least one of them has status=1 (e.g. dead), if the survival time for the person whose model-predicted risk for death is higher, then the sample pair is said to be concordant. Percentage of concordant pairs is the C-index. One minus the C-index, or proportion of discordance pairs, is called C-error by Ishwaran et al. (2008) [4], can also be called D-index for discordance rate.

Parameter Tuning and Setting in RSF
Following the discussions in (e.g.) [5], we examined the impact of the number of trees ( , ntree in rfsrc of the R package randomForestSRC), the minimum number of samples (on average) per node before stopping the growth of the tree , nodesize in rfsrc of the R package randomForestSRC), and number of variables used to split a tree 0 (or mtry in rfsrc). We did not see obvious impact of mtry on performance probably because the number of independent variables (seven) is too small [7]. The impact of ntree and nodesize on RSF performance is shown in Figure 1.  shows OOB D-index and IBS as a function of ntree in our colorectal survival data with 7 independent variables. At the default value of ntree=1000, the variation of the error calculation seems to be minimum. At 500 trees, the variation is about the same level. However, when the number of tree is too low (e.g. < 100), we may either get a very small error or a very large one for D-index, and larger error for IBS, by chance. is increased. A larger nodesize means a lesser grown tree, and it might prevent tree overfitting. However, if nodesize as a percentage of the total sample size is too large, it may have a chance to fit the data. With the guidence from Figure 1, we chose ntree=500 and nodesize=15.
Although traditionally, the training set selected is by bootstrap (subsampling with replacement), the default setting in rfsrc of randomForestSRC is samptype = "swor" or sampling without replacement. We changed the setting to swr (sampling with replacement) to be consistent with the original literature. Our conclusion is not affected by using samptype = "swor", which is recommended by some papers because of the availibility of theoretical results [5], but the Cox regression fitting will need to be carried out on the sampling without replacement training set also.

Always Compare Performance on Equivalent Sets
We run RSF 100 times on the colorectal dataset with 7 independent variables and calculate IBS for both OOB samples and IB (in-bag) samples, both directly provided by the rfsrc function (see Methods). The number of trees is fixed at 500, and nodesize value is randomly selected from (10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20). Figure 2(A) shows the sorted OOB IBS (black) and IB IBS (blue) from small to large. It is well known that OOB errors is larger than IB errors.

.632-weight (orange), whole dataset as both training and testing (red), unique IB (dark blue); (C) Cox regression: 3-fold CV(red), 10-fold CV(blue); (D) Cox regression: single classifier OOB (blue). The Cox OOB IBS in (B) is reproduced in (C) and (D) for comparison.
Similarly, we run Cox regression 100 × 500 times, where in each group of 500 runs, a random bootstrap set is used for fitting the regression, which is then applied to the OOB samples to calculate IBS (using our custom R code, see Methods); these 500 IBSs are averaged to get one IBS similar to that for RSF. Again, Figure 2(B) shows that IB (blue) has lower errors than OOB (black).
It was suggested Efron (1983) [14] that the testing error rates can be weighted down by a factor of 1 − −1 = 0.632, with the rest (weight = −1 = 0.368) replaced by training error rates. These weighted down errors when it is done on the forest level for RSF and on individual run level (equivalent to tree level) for Cox regression are shown in Figure 2(A) and (B) (orange). If we compare the IBS between RSF and Cox regression, that on OOB set in RSF should be compared to that on OOB set in Cox regression. Similarly, IB should be compared to IB, and 0.632weighted error should be compared to another 0.632-weighted error.
Two more types of error are shown in Figure 2(B) for Cox regression. One is to use all samples as both training and testing set (red). As there is no random factor involved, there is only one value. The red line is simply a repetition of the single value 10 times. Another type (we called it "unique IB") is to use each person/sample, if they are sampled multiple times in IB due to sample-with-replacement procedure, only once towards the error calculation (dark blue). Both are similar to IB error, though slightly higher.

Performance Determined by OOB Samples and by k-fold Cross Validation may not be the Same for Cox Regression
Most people use a k-fold cross-validation to evaluate the performance of Cox regression. In k-fold cross-validation, the whole dataset is partitioned into k groups: rotating each k-1 groups as training then applied to the kth group for error calculation, the final error is the average of these k errors. The = 3 choice leads to a situation similar to RSF in that 66% of the samples are used for training whereas 33% of the samples for validation. However, there are still two differences: one being that the training set in RSF contains examples instead of ( − 1) / for k-fold validation, another being that the final error in RSF is an average of ( ntree) runs and that in k-fold validation in Cox regression is an average of runs; and typically ≪ .
Figure 2(C) shows (sorted from small to large) IBS of 100 runs for 3-fold validation and 10-fold validation. Both tend to have a lower error than those calculated from OOB samples (average of the 100 runs: 19.86% for OOB, 19.58% for 3-fold CV, 19.28% for 10-fold CV). There have been several publications addressing related issues, though not necessarily for survival data [15,16]. Whether the choice of OOB overestimates the error or choice of kfold cross-validation underestimates the error, the message from Figure 2(C) is that when two models are compared, the same validation data selection scheme should be used.

Only the Ensemble Classifier Version of Cox Regression should be used in a Fair Comparison, not the Single Classifier Version
Since RSF is an ensemble classifier whereas Cox regression can be considered as a single classifier [17][18][19], our proposal for a fair comparison between the two is to convert Cox regression to an ensemble classifier. It is done by repeated random sampling subset and using the OOB samples for error calculation, then averaging errors from these individual runs.
What if Cox regression remains to be a single classifier? In other words, what if the repeated bootstrap, OOB error calculation is reduced to only one run? Figure 2(D) shows the OOB error for Cox regression without multiple runs. It is clear that without the averaging of multiple (e.g. 500 times) runs, the errors fluctuate widely (standard deviation (sd) for ensemble classifier version of the Cox regression: 0.001, whereas the sd for the single classifier: 0.024). If one use the single classifier version of Cox regression to compare with with RSF, sometimes the conclusion can be Cox regression outperforms RSF, and other times RSF outperforms Cox regression.

Similar Conclusions using D-index
The conclusions from Figure 3 can be similarly reached by the error calculation of D-index (see Method section). Figure 3(A) shows that the D-index for Cox regression with OOB, IB, unique-IB, as well as using the whole dataset as both training and testing, and D-index for RSF using OOB as testing set. All Cox regression errors as measured by Dindex using some samples in both training and testing set are lower than RSF-OOB, but those using OOB have higher errors. A fair comparison would lead to the conclusion that RSF performs better than Cox regression. Similar to IBS, D-index also shows that k-fold cross-validation for Cox regression leads to lower error than using OOB as testing set. The 10-fold CV has, on average, lower D-index error than 3-fold CV. The conclusion based on a fair comparison, i.e., using OOB error in both RSF and Cox regression, clearly favors RSF over Cox. However, when 10-fold CV error for Cox regression is used to compare the OOB error for RSF, the two two D-index errors are very close.
Finally, same as IBS, the D-index calculated by one-time OOB set for Cox fluctuates wildly from run to run. Single classifier has higher variance for error rate than ensemble classifiers, including both (e.g. 500-time) OOB and 10-fold CV, 3-fold CV.

Discussion
Random survival forest has several advantages over traditional survival analysis method, such as being unlikely to overfit the data with the complete set of independent variables, its easiness in judging the importance of variables for the purpose of variable selection, and its ability to incorporate nonlinear, interactive roles of multiple variables. These are not the topic addressed in the paper.
The motivation of this paper is an observation we have made on existing literature: that when RSF is compared to Cox regression, many papers conclude that the two have similar performance [20,21], while detail concerning the error calculation (on the Cox regression side) is not provided [22][23][24][25][26]. Some comparisons are even hard to judge because k-fold CV is introduced to RSF [27], or models with different number of independent variables are compared [28]. Steele et al. (2018) and Zhang et al. (2019) [29,30], k-fold CV is used in evaluating other models including Cox regression, which may be another example of unfair comparison.
On the other hand, when the performance is fairly compared, i.e., when the error is calculated over OOB ensemble for Cox regression, RSF usually is the winner over Cox regression [31,32]. The practice in Appendix C of Myte (2013) [33] is almost fair except the Cox regression is treated as a single predictor instead of an ensemble classifier. Unfortunately, we found more unfair or potentially unfair comparisons of error in the literature than the correct ones, with the latter more likely carried out in thesis, manual pages, or computer scientists and statisticians than applied researchers. This justifies the need to highlight this issue, i.e., the test set where the error rate or performance is calculated, should be equivalent when comparing two different data analysis techniques.
Although there is a professional written R packages pec [9] for evaluating IBS error in different situations, the aim of the package is to be as generic and wide as possible. To make the program to work as desired and to understand most detail may require a steep learning curve for practitioners whose main expertise is in a non-mathematical/non-statistical field. We try to fill this void by providing a simply written R codes for calculating only IBS/RSF (OOB), IBS/Cox (new samples), and D-index/Cox (new samples). The D-index/RSF (OOB) is the default output of error rate in randomForestSRC package). The R code is in github.com/wlicol/coxrsf.
Our codes only require a list of predicted survival probability of each person at sequence of time of interest [34]. For RSF, we use the survival.oob part of a rfsrc object to obtain the survival probability. For Cox regression, we use the survfit function from the survival R package (the surv part) to obtain the survivial probability.
To double check our program, Figure 4 shows a comparison of Brier score, estimated by using the fitted RSF on the whole dataset itself (not OOB), as a function of time, between our code and two other programs, pec and sbrier/ipred. We found that (1) the pec and sbrier/ipred results are very similar, as they both use the inverse probability of censoring (IPC) weights [3,4]. (2) Our simple code without using any weights lead to very similar result as the weighted BS's. (3) Our program has an extra option to print out BS(t) contributed from deceased samples (status=1) and cencored ones (status=0) separately. Figure 4 shows a huge difference between the two.
It was pointed out that when the independent variables contain both discrete and continuous variables, continuous variables may have an advantage to be chosen more often as the variable to split trees in RF (not RSF specific) [5]. This may rank some continuous variable ahead of categorical variables, and if the categorical are actually more important, underestimate the classifier performance [35]. This issue does not affect the analysis in this paper because our categorial variables (cancer type and gender) are not actually important. Also, this bias potentially underestimates the performance of RSF, so correcting the bias would make RSF even better than Cox regression.

Conclusion
In conclusion, in the current literature of application of random survival forest to real data, the majority of them may not compare the performance of RSF and Cox regression fairly, when the error (D-index or IBS) is not calculated on OOB samples, and/or without enough number of repeated trainings/testings. The k-fold CV may increase the performance of Cox regression over OOB, and if k is too small, the error rate may vary from run-to-run. We provide simple R code to aid the practive of fair comparison between RSF and Cox regression.

Funding
The authors received no financial support for the research, authorship, and/or publication of this article.

Acknowledgements
Wentian Li acknowledges the support from the Robert Boas Center for Genomics and Human Genetics.

Ethical Approval
Ethical approval was obtained from the Hatay Mustafa Kemal University, Faculty of Medicine on July 6, 2020 with the protocol number 32/28.

Data Availability Statement
The data presented in this study are available on request from the corresponding author.