Identification of influential observations in high-dimensional cancer survival data through the rank product test

Background Survival analysis is a statistical technique widely used in many fields of science, in particular in the medical area, and which studies the time until an event of interest occurs. Outlier detection in this context has gained great importance due to the fact that the identification of long or short-term survivors may lead to the detection of new prognostic factors. However, the results obtained using different outlier detection methods and residuals are seldom the same and are strongly dependent of the specific Cox proportional hazards model selected. In particular, when the inherent data have a high number of covariates, dimensionality reduction becomes a key challenge, usually addressed through regularized optimization, e.g. using Lasso, Ridge or Elastic Net regression. In the case of transcriptomics studies, this is an ubiquitous problem, since each observation has a very high number of associated covariates (genes). Results In order to solve this issue, we propose to use the Rank Product test, a non-parametric technique, as a method to identify discrepant observations independently of the selection method and deviance considered. An example based on the The Cancer Genome Atlas (TCGA) ovarian cancer dataset is presented, where the covariates are patients’ gene expressions. Three sub-models were considered, and, for each one, different outliers were obtained. Additionally, a resampling strategy was conducted to demonstrate the methods’ consistency and robustness. The Rank Product worked as a consensus method to identify observations that can be influential under survival models, thus potential outliers in the high-dimensional space. Conclusions The proposed technique allows us to combine the different results obtained by each sub-model and find which observations are systematically ranked as putative outliers to be explored further from a clinical point of view.

In this context, the Cox proportional hazards regression model [1] is the classical approach to deal with this type of censored data. It is based on a semi-parametric likelihood since the baseline hazard function, h 0 (t), is not specified, which contributes to its flexibility. Although the Cox regression model is a widely used method due to its simplicity, the corresponding estimator has a breakdown point of 1/n [2], which means that the presence of outlying observations may have extreme influence on the estimation of the model parameters. In order to handle this problem, a robust version of the Cox regression model has also been proposed [3].
The robust version of the Cox regression model [3] is based on doubly weighting the partial likelihood function of the Cox regression model. The robust Cox is an alternative method to the Cox regression model estimation, as a framework that allows to infer the parameters in a more robust way when outlying observations are present, i.e., individuals that lived too long or died too early when compared to others with the same clinical conditions. Furthermore, the weights obtained with this method can give information about which observations are more influential and therefore can be considered as putative outliers [4].
The detection of outliers in survival data has gained great importance due to the fact that the identification of individuals with survival time too high or too short can lead in the medical field to the detection of new prognostic factors [5]. The first attempts to analyze and to identify outliers were based on residuals. In this context, graphical methods based on the analysis of martingale, score and deviance residuals were proposed [6], and also other contributions including the log-odds and normal deviate residuals [5].
One of the challenges arising when dealing with patient' omics data is the highdimensionality problem. In this type of data, the number of covariates (p) is often much larger than the number of observations (n), i.e., p n. In this context, the usual statistical techniques for the estimation of the parameters cannot be applied, due to the inherent ill-posed inverse problem [7].
When dealing with thousands of covariates, as is the case for omics data, dimensionality reduction is a crucial initial step, leading to distinct models depending on the variable selection method used.
In this context, regularized optimization techniques are widely applied, which include the least absolute shrinkage and selection operator (Lasso) [8], Ridge and Elastic Net regularization [9]. The Lasso, uses an l 1 -norm regularizer, and the Elastic Net uses a linear combination of l 1 and l 2 penalties. In contrast with the Elastic net, in the presence of highly correlated variables, the Lasso tends to arbitrarily select one of them.
In this sense, depending on the methodology used to reduce the dimensionality of the data, different models are obtained and, consequently, distinct outliers are identified. The aim of this work is, therefore, given a high-dimensional dataset, to find outliers (or influential observations) from different sub-models, which are obtained from distinct techniques for variable selection. The method proposed is based on the Rank Product (RP) test, a non-parametric method, to identify the outliers that are consistently highly ranked in each of the sub-models. The ovarian cancer dataset, with gene expressions as covariates, was chosen to illustrate the applicability of the proposed method. Three gene expression sub-models are presented, and the RP test is applied as a consensus or ensemble test that combines the results obtained by each model, often distinct and sometimes contradictory. Notice that each sub-model has different baselines, since for this particular dataset there is no groundtruth to start from.
Although the rank product and the deviances measures for survival models were already proposed previously in different contexts, the combination of RP-based statistical tests as a means of conferring robustness to outlier detection tasks represents the main novelty of this work.
The outline of this work is as follows. In "Methods" section, the martingale residual used to detect outliers in survival analysis and the Rank Product test are explained in detail. In "Results" section the results concerning an application example are presented. Finally, Conclusions are addressed in "Conclusions" section.

Methods
The method proposed to obtain potential outliers considering different sub-models, is the Rank Product (RP) test. Before explaining this technique in detail, we need to select the measure used to obtain outliers in survival analysis.
There are in the literature a vast number of ways to identify abnormal (outlying) observations in survival analysis. The most common technique is based on the residuals, as referred before. More recent studies proposed other algorithms based on quantile regression [10] and the concordance c-index [11]. In the present work the focus will be given to the martingale residual but it is worth mentioning that the proposed method can be applied to any other deviance measures, as long as a final outlyingness ranking can be obtained.
The Martingale residuals arise from a linear transform of the Cox-Snell residuals [6] and are very useful for outlier detection for censored data.
Let all the covariates be fixed, the martingale residual for the i th individual is given bŷ where β = (β 1 , . . . , β p ) are the unknown regression coefficients, which represent the covariate effect in the survival,Ĥ 0 (t i ) represents the estimate of the cumulative baseline hazard, x i = (x i1 , . . . , x ip ) is the covariate vector associated with the i th individual and δ i is the censored function. These residuals are asymmetric and take values in (−∞, 1). The martingale residuals are the difference between the observed number of the events for the i th individual in (0, t i ) and the corresponding expected number, obtained by the adjusted model. The observed number of 'deaths' is one if t i is not censored, i.e., is equal to δ i . On the other hand, r i is the estimate of H(t i ), which can be interpret as the expected number of 'deaths' in (0, t i ), since it is only considered an individual. This residuals will reveal the individuals that are not well adjusted to the model. i.e., those that lived too long (large negative values) or died too soon (values near one), when compared to other individuals with the same covariate pattern.

Rank product (RP)
When dealing with high dimensional datasets, dimensionality reduction is warranted. Regularization methods are an example on how to overcome this challenge, as referred to before. However, different technique result in different estimated sub-models, which will significantly influence the obtained results regarding the identification of outlying cases.
In order to address this challenge, we propose a method that can combine all the results obtained for each one of the different sub-models. The rationale is that, if a given observation is systematically classified as an outlier, independently of the chosen sub-model, then our trust on the accuracy of that particular classification should increase. To accomplish this goal, the RP test is used.
From the theoretical point of view, the RP test is a non-parametric statistical technique which gained great importance in detecting deferentially regulated genes in replicated microarray experiments [12] and allowing the meta-analysis of independent studies [13].
The required input is a list of all the observations ranked by their level of outlyingness, based on one of the described methods for outlier detection. The backbone of this method is to allow the statistical assessment of a consensus rankings obtained in distinct sub-models, thus providing a combined identification of observations consistently ranked higher.
From the conceptual point of view, let n be the number of observations and k the number of different sub-models where the outlier detection method was performed. Consider that Z ij is a measure of the deviance (or outlyingness) of the i th observation in the j th submodel, with 1 ≤ i ≤ n and 1 ≤ j ≤ k. The deviance rank for each Z ij considering method j is defined by For each sub-model, the lowest ranks imply that the observation is more outlier that the others. After obtaining the ranks for each sub-model, the rank product is performed, Several methods were proposed in order to estimate the statistical significance of RP i under the null hypothesis of random (uniform) rankings. In [12] the distribution of RP i was based on a permutation approach. An alternative formulation that is less computational intensive was described more recently, based on an approximation of the logarithm of those values using the gamma distribution with parameters (k, 1) [14]. In [15] the exact probability distribution for the rank product was derived. The one chosen in the present study is based on the geometric mean of upper and lower bounds, defined recursively [16], since the algorithm provides accurate approximate p-values for the rank product when compared to the exact ones and is substantially faster in terms of computational execution.
Another key issue when performing these tests is related with the multiple testing problem. In fact, since many observations are tested, type-I errors (false positives) will increase. Several correction methods exist that usually adjust α so that the probability of observing at least one significant result due to chance remains below a desired significance level. The Bonferroni correction is one classical choice, with less conservative options also available, such as the False Discovery Rate (FDR) [17].
The FDR, which is the expected proportion of false positives among all tests that are significant, sorts in an ascendant order the p-values and divides them by their percentile rank. The measure used to determine the FDR is the q-value. For the p-value: 0.05 implies that 5% of all tests will result in false positives, instead, for the q-value: 0.05 implies that 5% of significant tests will result in false positives. The q-value is therefore able to control the number of false discoveries in those tests. For this reason it has the ability of finding truly significant results.
In this context, the RP is used as a consensus technique for all different results obtained by each sub-model. In order to illustrate this approach, the RP technique is applied to three sub-models, where the goal is to obtain outlying observations based on the martingale residuals, independently of the estimated sub-model. In order to evaluate the dependency of the results to the particular choice of the sub-models, a resampling strategy was also conducted.

Results
To evaluate the proposed consensus outlier detection method, the described procedure was applied to a high-dimensional dataset constituted by ovarian cancer patients microarray expression data.
For the analysis, this dataset was aggregated by the TCGA consortium allowing for the analysis to be reproducible with the original dataset. The clinical data was cleaned using "Days to last follow-up" and "Days to death" attributes to detect inconsistencies between them. Only the cases where the number of days matched were included in the analysis. The same process was performed for the attributes "Days to death" and "Vital status", where some cases had as status "deceased", but a missing "Days to death".
This dataset was analyzed in three different ways. In the first analysis the following regularization methods were performed [18]: 1) Lasso, 2) Lasso and elastic net, leading to two different sets of selected genes. The union of these sets was then considered, allowing to reduce the dimensionality from 12,042 to 109 covariates (genes). After this, a stepwise algorithm using the AIC (Akaike information criterion) was applied and 63 covariates were thus obtained. In the second analysis, 18 genes were considered, based on those selected in a previous study [19]. Finally, a third approach is presented where 22 genes were selected based on their reported association with ovarian cancer, as in the Genetics Home Reference https://ghr.nlm.nih.gov/condition/ovarian-cancer#genes. The list included also gene RAD51D which is not present in the original TCGA data and was therefore discarded from the analysis. Notice that for the three analysis considered there is no overlap of the covariates selected.
It is noteworthy that, although we have pursued these three analyses, we can indeed include many others, for example, using different feature selection methods or prior clinical information.
To overcome the fact that the results obtained for each of the analysis are model-based, a sampling strategy was also implemented in order to determine whether resampling the data using a sub-model of covariates (genes) would recognize the outliers previously identified. The resampling algorithm randomly picked 1000 genes (without replacement) from the ovarian cancer dataset. The Cox regression model with elastic-net regularization was then fitted (using glmnet), resulting in a reduced set of selected genes. In order to calculate the corresponding martingale residuals, a Cox regression is then performed on this reduced gene set (using coxph). The resulting residuals allow to sort the observations accordingly to their outlyingness level. This procedure is repeated 100 times, resulting in 100 models to feed the RP test.
All the analysis were performed in R [20] and are fully documented in the "Rmd File" as R Mardown files to allow full reproducibility. The libraries used for the analysis were: survival, for the Cox regression model to obtain the martingale residuals, and qvalue, to determine the q-values. The two robust versions of the Cox regression model were the coxrobust, and an improvement of this method available in [4]. The algorithm implementation to obtain the p-values for the rank product, based on the geometric mean, is provided by Heskes and colleagues [16].
The proportional hazard assumption [21] for the Cox's regression model was tested, and the results showed that this hypothesis was never violated. The p-values for each of the sub-model presented are the following: 0.1932 (63 genes), 0.3795 (18 genes) and 0.3868 (22 genes).
The majority of gene expression do not have a normal distribution (see Supplementary files for the Shapiro tests conducted) although this fact does not affect the resulting Cox models' validity.
In the next sections the results for the martingale residual, for each one of the models, and the RP that combine all the ranks, for each sub-model considered, is presented.

TCGA ovarian cancer -63 genes
For this particular model, the dataset can be represented as a matrix of size 517 × 63. The Cox's regression and the Cox's robust regression models were performed. The following 21 genes were significant for a 5% level of significance in all the methods considered: HPCA, RPS6KA2, GRB7, ABCD2, WDR76, NDUFA3, PI3, BNC1, D4S234E, CSNK1G1, SSTR1, PSG3, GAS1, POPDC2, DAP, SRY, HOXD11, HSPA1L, PPP3CA, MPZ and LBP. Also 11 genes for the Cox proportional hazard and 13 genes for the Cox robust, were not significant, for a 5% level of significance. Those differences are regarding genes: SDF2L1, PRR16, ALG8 and ELA3A. Genes SDF2L1, PRR16 and ALG8 were not significant in the Cox robust and significant in the Cox regression model and gene ELA3A was significant in the robust case ( [4], proposal) and not significant for the classical Cox. For more details, see Table 1. Figure 1 shows that observations 39 and 350 are identified as influential observations in the sense that they have the lowest weights. The results regarding the residuals are shown in Fig. 2. Again observations 39 and 350 in the martingale residuals appears to have the lowest values when compared to all the others.

TCGA ovarian cancer -18 genes
The features selected were based on the work of [19] where the authors considered as covariates of the Cox model the expression of 18 genes. The dataset is a matrix of size 517×18, and, in this case, the only genes statistically significant were: CRYAB and SPARC, for Cox's and Cox's robust (see Table 2).
The CRYAB gene codes for the crystallin alpha B chain, a protein that acts as a molecular chaperone. Its function is to bind misfolded proteins and, interestingly, some defects associated to this protein and gene have already been associated with cancer, among other diseases. In particular, a recent study [22] analyzed which molecular factors could affect ovarian cancer cell apoptosis and the authors found out that there was a statistical  This protein has, indeed, a negative regulation of tumor necrosis, which may explain these results.
The SPARC gene codes for Secreted protein acidic and rich in cysteine, a protein that appears to be a regulator of cell growth, by interaction with cytokines, the extracellular matrix and also binding calcium, copper, and several others biochemical compounds. This protein is overexpressed in ovarian cancer tissues [23], playing a central role in growth, apoptosis and metastasis. It also has been identified as a candidate therapeutic target [24]. Figure 1 shows that observations 113 and 219 are identified as influential observations (lowest weights). However, for this example, the weights are not so distinct in the sample. The results regarding the residuals are shown in Fig. 2. Observation 219 in the martingale residuals has the lowest value when compared to the all the others.

TCGA ovarian cancer -22 genes
The dataset considered is a matrix of size 517 × 22, where the number of columns corresponds to the genes that are most associated with the ovarian cancer. Interestingly, only a b c Fig. 2 Plot of the martingale residuals for the TCGA ovarian cancer data for each one of the sub-models. a 63 genes expression, b 18 genes expression, c 22 genes expression two genes in this dataset are statistically significant: BRCA2 in the Cox model, and PALB2 when considering both Cox model and its Heritier's robust version [4] (see Table 3). Both BRCA2 and PALB2 genes encodes a protein that may function in tumor suppression (for more details see https://ghr.nlm.nih.gov/gene/. In the BRCA2 this protein is to help repair damaged DNA ensuring the stability of the cell's genetic material. If the BRCA2 gene is mutated/changed the DNA could be corrupted developing genetic alterations that can lead to cancer. In [25] is conducted a study where BRCA1 and BRCA2 genes mutations account for the majority of hereditary ovarian carcinomas. On the other hand the PALB2 is related to breast cancer. Recent studies [26] showed that women who carry mutations in the PALB2 gene are at similar breast cancer risks as those who carry mutations in BRCA2. When using the weights of the robust version, 114 is identified as an influential observation (Fig. 1). Figure 2 shows the results concerning the residuals. Observations 114 and 211 in the martingale residuals have the lowest values when compared to all the others.
To overcome the fact that, for each sub-model, different outliers are obtained, the RP test was performed. The results are presented in the next section.

Rank Product results
The ranks of the martingale residuals for each sub-model were determined. The product of the ranks was obtained, and, finally, the p-values and corresponding q-values were calculated, as shown in Table 4. Based on those results, and considering a 5% level of significance, the observations that are considered outliers based on the three different sub-models are: 55, 114, 211, 219 and 455. Notice that three of the observations considered as outliers in the RP test had low values for the martingale residual. Observation 219 for the model with 18 genes, and observations 114 and 211 for the model that considered 22 genes.
The overall values of the survival time are between 8 to 5481 days, with the first, second and third quantile: 376, 923 and 1483, respectively. Only approximately 3% of the observations had a survival time higher than 2500 days. Regarding observations 114, 211 and 219 the survival time is, respectively, 2780, 3953 and 3525 (maximum was 5481 days), all censored, see Table 4. In this way the observations identified are long-term survivors.
To illustrate the robustness of the RP test, a resampling technique was performed as described above. The results displayed in Table 5 show that the observations considered outliers for the three different sub-models are also outlying observations for the 100 different models obtained. This includes all the observations considered outliers in Table 4. Indeed, there are individuals that consistently appear with larger residuals, irrespectively of the model. It is noteworthy that, although the genes selected in each model are different, there is a set of patients that always exhibit discrepant values for their survivals, as would be predicted by their covariates. This illustrates the robustness of the method to a particular choice of the model.
These results show that the proposed method was able to combine in a statistically solid way the results of different estimated models. In particular, the application of the RP test allowed to identify a consensual list of putative outliers in the dataset in a semi-automatic way, paving the way for the analysis of other datasets where discrepant observations are a critical issue in clinical applications.