Estimation of Right-Censored Data with Partially Linear Models: A Comparison for Different Censorship Solution Methods

Recently, because there is information inflation, collected data generally include missing or censored values. Therefore, it is hard modelling and analyzing datasets accurately especially in medical and clinical studies. In this paper, to handle the right-censored data which is the most common kind of censored data, three different solution methods are introduced. After that to see their effects on the modelling process, the semiparametric regression model is used based on smoothing spline.


Introduction
In biomedical applications mainly in clinical trials the two important issues arise when studying time to event data. We will assume the event to be death. It can be any event of interest.
1. Some individuals are still alive at the end of the study or analysis so the event of interest namely death hasn't occurred. Therefore, we have right censored data. 2. Length of follow-up varies due to staggered entry. So, we can't observe the event for those individuals with insufficient follow-up time.
Right-censored data is a common phenomenon that emerges in various applied fields. In statistics literature, this kind of data is encountered especially in medical studies and generally, datasets are formed by incomplete observations. It can be clearly said that one of the most important problems which distort the data quality is censored observations. As known, in practice, datasets are commonly problematic. Generally, datasets are formed by missing or censored observations because of many different reasons such as a death of patients abruptly, withdraw from the study, equipment malfunctions and so on.
In terms of regression analysis, classical methodology is inapplicable for such an incomplete data. As known, in analyzing survival data, there are three common approaches to overcome the censorship which are Kaplan- [1][2][3][4][5][6][7][8]. studied the imputation of microarray data. In the literature, mentioned three methods are always studied separately. In the literature, mentioned three methods are always studied separately. Qualitative comparisons were made between the weighted log-rank statistics and weighted Kaplan-Meier (WKM) statistics. A statement of null asymptotic distribution theory is given, and the choice of weight function is discussed in detail. Small-sample simulation studies indicates that statistics compare favorably with the log-rank procedure even under the proportional hazards alternative and which can perform better than it under the crossing hazards alternative [9][10][11].
It is aimed to compare mentioned methods and arising the important differences and properties of them when data is modelled by semi-parametric regression method based on smoothing splines. In addition to that, the interactive web application is presented to provide simplicity for modelling the survival data with semi-parametric models. In statistics, semiparametric regression includes regression models which combines nonparametric and parametric models. They are mostly used in situations where the fully nonparametric model may not perform well or when the researchers wants to use a parametric model but the functional form with respect to a regressors subset or the density errors is unknown.

Smoothing splines
A semi-parametric regression model can be written as follows for the uncensored observations: where Y i 's are the values of response variable, In this paper, our interest is estimating the model (1) when Y' i is observed incompletely and right-censored by a random censoring variable C i . Here, it should be noted that all through this paper, it has been assumed that explanatory variables (x i , t i ) are completely observed. Consequently, incompletely observed variables (Y i, X i , t i ) replace with (Z i, δ i , X i , t i ). Here, binary dataset (Z i, δ i ) can be expressed as follows: Where Z (i) is the updated response variable with respect to censored data with unknown distribution ( ) , 1,..., (3) and δ' i are the values of censoring indicator function which contains censoring information. If i th observation is censored, then δ i = 0 or δ i = 1. Thus, model is rewritten as follows: ( ) , 1,..., To see details about estimation of model (3) Green and Silverman can be inspected.

Censorship solutions
In order to involve the censorship in the estimation process, three common methods are used which are Kaplan-Meier weights, synthetic data transformation and kNN imputation method. These methods turn data into an appropriate form for modelling procedure.

Kaplan-Meier weights:
To overcome the right-censored response observations Kaplan-Meier (K-M) weights are expressed which is discussed by Stute with details. K-M weights can be calculated based on the Kaplan-Meier estimator  F of the distribution function F of lifetime's Y' i s at each value Z (i) given by: Where δ i denotes the value of censoring indicator associated with ordered values Z (i) Synthetic data transformation: Synthetic data transformation is a common method in the literature to handle with censored data and various authors are proposed different methods such as; Leurgans, Buckley and James and Koul et al. [5][6][7]. In this paper, synthetic response values are obtained based on the method proposed by Koul et al. [6]. In this sense, data transformation is realized by: where G is the distribution function of censoring variable as mentioned before?
kNN imputation method: Imputation is a class of methods that focuses to fill the censored observations with estimated ones. It can be realized with using true relationships in the dataset that can be helpful in estimating censored observations. In this paper, kNN imputation is used. In this context, it can be said that using imputation methods for handling the censored data has some differences from the mentioned two methods. Some differences from other two methods and the important properties of kNN imputation can be ordered as follows 1. Method is free from distribution. This feature provides an important advantage when dealing with data that does not fit any distribution family.
2. Right-censored data points are completed with actual observations, not synthetic or constructed values.
3. Different from synthetic data transformation and K-M weights, kNN method utilizes the explanatory variables to supply additional information in completing censored data points.

4.
One of the most important properties of kNN imputation is fully nonparametric method and it does not include any assumptions about the relationship between observation pairs ( , ) ( , ), 1,....., Xi Yi or Xi Zi i n = .
Method can work with discrete and continuous variables. It uses most frequently used data point among k-nearest neighbours. For continuous attributes, it uses mean value of k-nearest neighbors.

Discussion and Conclusion
Researchers deal with datasets from different populations have different distributions. Mentioned solution methods for censorship have some advantages of their own according to properties of data. If dataset is suitable for Kaplan-Meier estimator, Kaplan-Meier weights and Synthetic data transformation can be useful. On the other hand, kNN method can work ultimately free from assumptions which is its important advantage. As a result is up to the researcher.