Prognostic factors of first-ever stroke patients in suburban Malaysia by comparing regression models

in


INTRODUCTION
World health statistics reported that the leading causes of mortality among individuals around the globe were ischemic heart disease, lower respiratory infections followed by stroke.Years of life lost were calculated from the number of deaths at each age multiplied by the life expectancy for the age at which death occurred.For the last 12 years globally, the proportion of years of life lost due to non-communicable diseases has increased from 38% to 47% .Even though the overall recent trend has been successful in reducing mortality in communicable diseases, mortality due to non-communicable diseases has become the primary cause of death, with stroke as one of them [1].
Multivariable analysis is extremely important to statistically adjust the estimated effect of each variable in the model and for more comprehensive statistical modelling.Cox proportional hazards regression model was widely used to determine the prognostic factors of mortality in stroke patients [2][3][4][5][6].An alternative to Cox model reported in the literature was the binary logistic regression model [7][8][9][10].However, there was no available information reported on the comparison of different statistical modelling used for the determining of prognostic factors of mortality in first-ever stroke patients.
Even though a few studies have been reported comparing multivariable statistical models towards an outcome, there are still very few similar studies reported in clinical settings.In addition, there is a major concern about collecting and analysing quality data on serious clinical conditions such as stroke to determine clinically important and plausible prognostic factors.There are always major questions whether which kind of outcome data to be collected and what multivariable analysis to be applied.There is a serious need to prove that if reliable data with meaningful variables could be collected, then there should not be a serious thought on which analysis would be applied since it is hoped that results should be very similar towards common outcome such as mortality.
Modelling of data using different statistical analyses is compared in terms of five important major parameters, namely direction, estimation, precision, significance, and magnitude of the parameter estimates.Similar findings of these five points are hypothetically postulated with good data quality and data management.Data quality starts with the well-defined variables containing categorical and numerical variables to be included in the study.Data quality also involves the determination of the minimum required sample size and appropriate sampling methods.The validity and reliability of measurement tools also influence the quality of data.Data management involves data entry, data coding, selection of appropriate univariable and multivariable statistical tests and clinically, biologically, and statistically plausible data analyses.
This study was an eye opener to the researchers that they should not over-worry about the analysis related to outcome and to prove that even by using different statistical methods, the direction, estimation, precision, significance, and magnitude of parameters are similar, provided data is of good and reliable quality.Researchers can choose the statistical test based on the available data to answer the research questions.Another justification for this study was to highlight which analysis provides more reliable, informative, and bestestimated results.Our study was conducted to compare the parameter estimates of prognostic factors of mortality in firstever stroke patients using three different multivariable regression methods, including Cox proportional hazards regression, multinomial logistic regression, and multiple logistic regression.

METHODS
A retrospective study among 432 first-ever stroke patients receiving care at Hospital Universiti Sains Malaysia, a 700-bed hospital servicing a predominantly rural area in Northeast Malaysia was conducted.Data from medical records were reviewed, and related information was extracted using a standardized data collection sheet.The inclusion criteria of the participants were the individuals who were clinically diagnosed as first-ever stroke aged more than 18 years old as confirmed by computed tomography scan or magnetic resonance imaging and neurological examination during admission.Participants with recurrent stroke or having any neurological deficits secondary to an infection, epilepsy, tumor, or traumatic causes were excluded from the study.
Power and sample size calculation software version 3.1 (11)(12) was used to determine the minimum required sample size.The determination of sample size by Cox regression was calculated based on variable types of strokes.The parameters required for calculation of sample size were level of significance (α) of 0.05 and with pre-determined power (1-β) of 0.80, the detectable hazard ratio of those with subarachnoid haemorrhage to those with cerebral infarct was decided by the researcher based on clinical expert opinion (hazards ratio [HR]=2.2), the median survival time of stroke patients with cerebral infarct were obtained from the literature [13].m1=84, the ratio of stroke patients with cerebral infarct to those with subarachnoid haemorrhage was obtained from the literature [13] (m=1,318/238=5.54), the accrual time (A) during which the patients were follow-up was 84 months and with an additional follow-up time (F) of 12 months.The predetermined sample size was 453 patients after adding 10% to the final figure in anticipation of non-readable records and missing data.Systematic random sampling was applied from the list of stroke patients within the study period.
A standardized data collection proforma was designed and verified by another researcher to record all the related information from patient's medical records, namely demographic characteristics, past medical history, clinical characteristics, medications prior to stroke and symptoms and signs of first-ever stroke patients.The dependent variables in this study were determined according to the chosen statistical analysis.For Cox regression, the dependent variable was time to time event, which was the survival time of first-ever stroke patients, measured in days.The survival time was defined as the time interval between the time of diagnosis of stroke and the time of death due to stroke, whereas death was the event of interest.The censored observations were patients who did not experience the event of the study, who were still alive at the end of the study period and who were lost to follow-up during the study period.For multinomial logistic regression, the dependent variable was the status of the patients, which was divided into three levels: alive without neurological deficit, alive with neurological deficit and dead.On the other hand, for multiple logistic regression, the dependent variable was dichotomous, which was alive and dead.

Statistical Analysis
The data were entered and analysed by using Stata/SE version 11 software and IBM SPSS statistics version 22 software.The statistical tests used in this study were Cox proportional hazards regression, multinomial logistic regression, and multiple logistic regression.The comparison of three statistical models was based on the five major parameters, namely direction, estimation, precision, significance, and magnitude of parameter estimates based on different measures of dependent outcome variables and the type of regression method.
The first point direction of the regression coefficient was related to the risk estimates.A positive regression coefficient indicated positively related to the mortality, and a negative regression coefficient indicated protective towards mortality.The second point was estimation with a 95% confidence interval (CI) from sample statistics to population parameters.The third point was the width of CI, known as precision.The next point was the significance of the statistical analysis based on hypothesis formulation and testing, looking at p-value.Lastly, the magnitude of the risk, namely HR, relative risk ratio (RRR) and odds ratio (OR), if these risk estimates were away from the null hypothesis with higher magnitude or away from the null hypothesis with lesser magnitude.Some variables had the same findings of all parameters, which indicated not being able to reject the null hypothesis.
Cox proportional hazards regression was a regression model, which involved the time to an event as the outcome of interest or known as survival time.After univariable analysis, the variables were chosen for the multivariable analysis based on pre-determined criteria.The preliminary main effect model was obtained upon completion of the variable's selection.This was followed by checking linearity of continuous variables, then checking two-way interaction and multicollinearity to obtain the preliminary final model.The preliminary final model was then checked for its specification error.The next step was checking the proportional hazards assumptions (by using a hazard function plot, a log-minus-log plot, a scatterplot of scaled Schoenfeld residuals, scaled and unscaled Schoenfeld residuals test (r * k[β]=V-1[β, tk]rk[β]) and C-statistics.Then regression diagnostic was performed including Cox-Snell residual, martingale residual, deviance residual and influential analysis [14].From influential analysis, any potential influential outlier was detected.The final model was achieved after remedial measures were performed by calculating the per cent changes in the regression coefficient.If the percentage were equal and more than 20%, the outlier was considered an influential outlier.The results were expressed based on determined variables, adjusted regression coefficient (b), adjusted HR with 95% CI, Wald statistic and its corresponding p-values.
Multinomial logistic regression was the estimation of the relationship between a polytomous dependent variable and more than one independent variable or covariates.Univariable multinomial logistic regression was performed for each logit function to screen for the important prognostic factors.From the univariable analysis, the variables with p-values less than 0.25 and clinically important were selected for multivariable analysis, which adjusted for confounders.Here the preliminary main effect model was achieved.Then, the linearity of the continuous variables for separate binary models or for each logit function was fitted.The next step was checking interaction and multicollinearity between independent variables to obtain the preliminary final model.Then, the overall fit of the model for separate logit functions was assessed by using four methods, namely the Hosmer-Lemeshow test, Pearson chi-square test, classification table and area under receiver operating characteristic (ROC) curve.Several plots have been suggested in the regression diagnostic to identify the influential covariate patterns for each logit function.The plots consisted of leverage (h), delta Chi-square (ΔX 2 ), delta deviance (ΔD), and pregibon delta beta (Δβ) versus the estimated logistic probability [15].The covariate patterns identified as influential factors, which were then further assessed for their changes in the regression coefficient.The changes that are equal, and more than 20% indicated that the covariate pattern was influential to the model.The final model was expressed based on determined variables, adjusted regression coefficient (b), adjusted RRR with 95% CI, Wald statistics and its corresponding p-values.
Multiple logistic regression is the estimation of the relationship between a dichotomous dependent variable and more than one independent variable or covariates, either numerical or categorical variables.Simple logistic regression was performed to screen for the important prognostic factors.Any variable with a p-value less than 0.25 and biologically plausible was recommended to be selected for the multivariable model.The preliminary main effect model was achieved after performing the variables selection.Then, checking linearity of continuous variables followed by checking interaction and multicollinearity were performed.The model was then termed as the preliminary final model.The preliminary final model was then checked for its specification error.Then, the overall fit of the model was assessed by using four methods, namely the Hosmer-Lemeshow test, Pearson chi-square test, classification table and area under ROC curve.Several plots have been suggested in the regression diagnostic to identify the influential covariate patterns.The plots consisted of leverage (h), delta Chi-square (ΔX 2 ), delta deviance (ΔD), and pregibon delta beta (Δβ) versus the estimated logistic probability.The covariate patterns identified as influential factors were then further assessed for its changes in the regression coefficient.The changes that are equal, and more than 20% indicated that the covariate pattern was influential to the model.The final model was expressed based on determined variables, adjusted regression coefficient (b), adjusted OR with 95% CI, Wald statistic and its corresponding p-values.

RESULTS
The comparison of statistical modelling using three different statistical analyses is shown in Table 1.There were 12 significant prognostic factors detected when using Cox regression, 11 factors when using multinomial logistic regression and nine factors when using multiple logistic regression.The male was identified as a significant variable for all three regression models, where it was identified as a protective factor towards mortality.
The second variable identified as a significant prognostic factor towards mortality was fasting blood sugar.The third significant prognostic factor towards mortality was marital status with four levels; never married (as a reference), married, widowed, and divorced.Being married was a protective factor towards mortality in all three models.On the other hand, being widowed was a protective factor in Cox regression and multiple logistic regression.Being divorced was not observed as significant determinants in all three models.Other significant prognostic factors towards mortality were diastolic blood pressure and systolic blood pressure; however, it was identified only in Cox regression.The next prognostic factor towards mortality was urea.Another factor was rheumatic heart disease, which was identified in two models, Cox regression and multiple logistic regression.Another significant factor for all models was smoking status with three levels; never smoke (as a reference), ever-smoker and current smoker.Having seizure/fit was protective determinants towards mortality and was observed in Cox regression and multiple logistic regression.Glasgow coma scale was also reported to be a significant prognostic factor towards mortality for all three models.Other prognostic factors were the usage of aspirin and age at the time of diagnosis; however, these variables were only identified in Cox regression model.Final diagnosis with three levels (ischaemic stroke [as a reference], intracerebral haemorrhage and subarachnoid haemorrhage) was identified as a prognostic factor in multinomial logistic regression and multiple logistic regression models but not in Cox regression model.However, only subarachnoid haemorrhage was observed as a significant prognostic factor towards mortality.Atrial fibrillation and paresis at any site were identified as other determinants towards mortality but were only identified in multinomial logistic regression but not in other two models.

DISCUSSION
Worldwide, stroke comes second after ischaemic heart disease as the leading cause of death over the past decade [16].
In Malaysia, stroke represents one of the ten principal causes of hospitalization and death in Malaysian hospitals [17].This current study was a retrospective study, which aimed to determine prognostic factors of mortality among first-ever stroke patients receiving care at a 700-bed hospital servicing a predominantly rural area in Northeast Malaysia by using three different regression models.These three models were then compared based on the direction, estimation, precision, significance, and magnitude of determinants of parameters.Despite using three different statistical analysis approaches, namely Cox regression, multinomial logistic regression and multiple logistic regression, the results were the same in terms of five major parameters based on inferential statistics.First was direction of risk.The second was based on the direction estimation with a 95% CI from sample statistics to the population parameter.Third was the width of CI, known as precision.Fourth was the significance of the statistical analysis based on hypothesis formulation and testing, looking at pvalue and the level of significance.Lastly was the magnitude of the risk, whether away or not from the null hypothesis.
The summary of the comparison of three different statistical modelling is shown in Table 2  In the current study, the male was a protective factor towards mortality for all three regression models.The direction of the regression coefficient for the three models was negatively related to the mortality, given the estimation of the hazard's ratio, RRR, and OR, all less than one.The width of the 95% CI was narrower in Cox regression compared to the other two regression models, indicating it was more precised in Cox regression.p-value was smaller, showing highly significant in Cox regression.All three measurements of risk were away from the null hypothesis with lesser magnitude.
Since OR was the estimation of true risk, the value of OR was a bit larger compared to the other two measurements of risk.Overall, for variable gender, the results were not much different among the three statistical analyses, even though it was slightly better in Cox regression.
On the other hand, for fasting blood sugar, its regression coefficient was positively related to the mortality; with the values of estimation of HR, RRR, and OR were more than one.The width of the 95% CI was not much different between the three statistical analyses, even though it was slightly précised in Cox regression.The significance of p-value also was highly significant in the model using Cox regression.The risk assessments were away from the null hypothesis with a higher magnitude, and the magnitude of the hazard ratio was a bit lower in Cox regression Being married was a protective factor towards mortality.The regression coefficient of married was negatively related to the mortality, with the values of risk assessment less than one.The width of the 95% CI was narrower in the model using Cox regression, showing precision was better in this model.Even though p-value was significant for all three analyses, it was highly significant in Cox regression.The risk estimates were away from the null hypothesis with a lesser magnitude.The magnitude of the hazard ratio was lower compared to RRR and OR for the married variable.
The regression coefficient of the widowed variable was negatively related to the mortality with the values of risk assessment less than one.The width of the 95% CI was narrower in the model using Cox regression, showing precision was better in this model.The significance of p-value was observed only in models using Cox regression and multiple logistic regression but not in multinomial logistic regression.The risk estimates were away from the null hypothesis with lesser magnitude in the two models.The magnitude of the hazard ratio was a bit smaller compared to OR.However, the risk estimate in multinomial logistic regression not being able to reject the null hypothesis.Thus, the magnitude of RRR was not interpretable.
For divorced variables, the regression coefficient was negatively related to the mortality, with the values of risk assessment less than one for the model using Cox regression and multinomial logistic regression.However, it was a contrast to the model using multiple logistic regression, where the regression coefficient was positively related to the mortality with the value of OR of more than one.The width of the 95% CI was narrower in Cox regression.In terms of the significance of p-value, all models gave insignificant results.The risk estimates were not able to reject the null hypothesis.Therefore, the magnitude of the risk assessment could not be interpretable.
Diastolic blood pressure, which was identified only in Cox regression, the regression coefficient was positively related to the mortality with a hazard ratio of more than one.The width of the 95% CI was relatively narrow.p-value showed the variable was highly significant.The hazard ratio was away from the null hypothesis with a higher magnitude; thus, the magnitude of the hazard ratio could be interpretable.
Next was the level of urea; its regression coefficient was positively related to mortality with the risk assessment of more than one.The width of the 95% CI was relatively narrow in Cox regression compared to the other two analyses.The significance of p-value was observed in all three models.However, it was highly significant in the multiple logistic regression model.The three risk estimates were away from the null hypothesis with a higher magnitude.The magnitude of the hazard ratio was a bit smaller compared to the other two risk assessments.
Another prognostic factor was systolic blood pressure; however, it was identified only in Cox regression.The regression coefficient was negatively related to the mortality with a hazard ratio of less than one.The width of the 95% CI was relatively narrow.p-value shows the variable was highly significant.The hazard ratio was away from the null hypothesis with a lesser magnitude; thus, the magnitude of the hazard ratio could be interpretable.
The next factor was rheumatic heart disease, which was identified in two models, Cox regression and multiple logistic regression.The regression coefficient was positively related to the mortality with the value of risk assessment more than one.The width of the 95% CI was relatively wide for both models, especially in the multiple logistic regression model.The significance of p-value was highly significant in survival analysis compared to multiple logistic regression.Both risk estimates were away from the null hypothesis with a higher magnitude, and the magnitude of the hazard ratio was much lower compared to OR.
Another significant factor was smoking status.The regression coefficient of ever-smoker was positively related towards mortality with the estimation of risk assessment of more than one.The width of the 95% CI was relatively wide for the multinomial logistic regression model compared to the other two models.In terms of the significance of p-value, Cox regression gave the smallest p-value.The risk estimates of the three models were away from the null hypothesis with a higher magnitude.The magnitude of risk was not much different in the model using Cox regression and multiple logistic regression.
On the other hand, for current smokers, the regression coefficient was also positively related to the mortality with the estimation of risk assessment more than one.The width of the 95% CI was observed to be wider for the multinomial logistic regression model and narrower for Cox regression model, showing precision was better in this model.The significance of p-value was highly significant for Cox regression.The risk estimates were away from the null hypothesis with a higher magnitude for all three models.The magnitude of the hazard ratio was a bit lower, followed by OR and RRR.
Another prognostic factor towards mortality was seizure/ fit, which was observed in Cox regression and multiple logistic regression.The direction of the regression coefficient was negatively related to the mortality with the estimation of risk assessment less than one.The width of the 95% CI was the same for these two models.Cox regression reported a smaller p-value compared to multiple logistic regression.The risk estimates for both models were away from the null hypothesis with lesser magnitude.The magnitude of risk assessment for both models was not much different.
Glasgow coma scale was also reported to be a significant prognostic factor towards mortality for all three models.The direction of the regression coefficient was negatively related towards mortality, with the risk assessment value less than one.The width of the 95% CI for all the models was not much different and relatively narrow.p-value was reported to be highly significant for all models.The risk estimates were away from the null hypothesis for all models with lesser magnitude.The results for the magnitude of risk assessment were similar in all models.
Another prognostic factor was the usage of aspirin; however, it was only identified in Cox regression model.The regression coefficient was negatively related towards the mortality with a hazard ratio of less than one.The width of the 95% CI was relatively narrow.p-value shows the variable was statistically significant.The hazard ratio was away from the null hypothesis with a lesser magnitude; thus, the magnitude of the hazard ratio could be interpretable.
Another factor in Cox regression was age at the time of diagnosis.The direction of the regression coefficient was positively related to the mortality with a hazard ratio of more than one.The width of the 95% CI was relatively narrow.pvalue was highly significant, and the risk estimate was away from the null hypothesis with a higher magnitude.Thus, the magnitude of the hazard ratio could be interpretable.
Types of diagnosis were identified as a prognostic factor in multinomial logistic regression and multiple logistic regression models.For intracerebral haemorrhage, the direction of the regression coefficient was positively related towards mortality with the estimation of risk assessment more than once.The width of the 95% CI was relatively narrow for both models.In terms of the significance of p-value, both models gave insignificant results.The risk estimates for both models were not able to reject the null hypothesis.Therefore, the magnitude of risk assessment could not be interpretable.
The direction of the regression coefficient for subarachnoid haemorrhage was positively related towards mortality with the estimate of risk assessment of more than one.The 95% CI was more precise in multiple logistic regression as the width was narrower compared to the multinomial logistic regression.Multiple logistic regression gave a smaller p-value.Both risk estimates were away from the null hypothesis with a higher magnitude.OR gave a smaller magnitude compared to RRR.
Another prognostic factor towards mortality was atrial fibrillation but was only identified in multinomial logistic regression.The direction of the regression coefficient was positively related to the mortality with an estimate of an RRR of more than one.The width of the 95% CI was relatively wide.The variable gave the significant result of p-value; the risk estimate was away from the null hypothesis with a higher magnitude.Therefore, the magnitude of risk assessment was interpretable.
Paresis at any site was identified as another prognostic factor in multinomial logistic regression but not in the other two models.The direction of the regression coefficient was positively related towards mortality with risk assessment more than once.The width of the 95% CI was relatively acceptable.p-value was highly significant, and RRR was away from the null hypothesis with a higher magnitude; therefore, it was interpretable.
Comparison of each variable in three different regression models yielded more or less similar results, even though it was slightly better in Cox regression, especially in terms of precision of the results.
Few studies reported that Cox proportional hazards model was better than other comparison model.A retrospective study among the United States elderly aimed to identify factors associated with long-stay nursing homes by comparing multivariable logistic regression and Cox proportional hazards model [18].There were nine significant variables in logistic regression, while only eight significant variables in Cox model.The findings reported hazard ratio from Cox model and OR from logistic regression were similar.However, standard error was found to be smaller in Cox model, indicating results from this model were more précised than from the logistic regression model.This study only reported the comparison based on the precision, which was measured using standard error instead of using the width of CI like in the current study.
The width of CI depended partly on a standard error, which included both standard deviation and sample size [19].Another prospective study among patients admitted to the intensive care unit comparing three logistic regression models reported that models using dichotomous scales were more appropriate when compared to those using ordinal or quantitative as these models failed to fulfil the model assumptions and estimated biased coefficients.
The extended Cox regression model was the most valid model as it fulfilled all the assumptions, estimated unbiased coefficients, and obtained a precise assessment of risk.However, this study did not report any findings on a comparison based on five aspects as in the current study [20].
A prospective study in a university teaching hospital in France among neurosurgical patients using logistic and Cox models to assess the risk factors for surgical site infections revealed only one significant risk factor for surgical site infection.The parameter estimates, including risk estimates (HR for Cox model and OR for the logistic model) and its 95% CI, were of similar findings.The results of this study reported only three aspects of the parameter, namely estimation, precision, and magnitude of the risk [21].
The study from the project to prevent falls in veterans compared the performance of eight regression models for analysing the risk of falling, focusing on the effect of physical inactivity in older veterans.Three of the eight regression models treated falling as a non-recurrent event, and the parameter estimates were OR by logistic regression, risk ratio by modified Poisson regression and HR by Cox proportional hazards regression.
It was reported that the magnitude of the point estimates was different among the three models; the risk ratio was the smallest, followed by the hazard's ratio, and OR gave the largest value.CI also reported similar results; CI for risk ratio gave the most precise results compared to the two-point estimates [22][23][24][25].This study gave similar results as in the current study in terms of Cox proportional hazards regression and logistic regression on the aspects of estimation, precision, and magnitude of risk assessment.

CONCLUSIONS
Given that, this study had compared three types of statistical analyses frequently used in the literature to analyse stroke data.Our study has clearly yielded that with reliable available data towards a common outcome of interest, different multivariable regression methods provide comparable findings with emphasis on five aspects: direction, estimation, precision, significance, and magnitude, which appeared clinically and statistically plausible.

Table 1 .
. Comparison of findings of prognostic factors of first-ever stroke patients using three different advanced statistical modelling approaches a Multinomial logistic regression for second logit function (death vs. alive without neurological deficit); b: Regression coefficient; CI: Confidence interval; HR: Hazard ratio; RRR: Relative risk ratio; & OR: Odds ratio

Table 2 .
Summary of comparison of three different statistical modelling

Table 2 (
Continued).Summary of comparison of three different statistical modelling

Table 2 (
Continued).Summary of comparison of three different statistical modelling