生存分析随机森林实验与代码.docx-资源下载

生存分析随机森林实验与代码.docx

1、生存分析随机森林实验与代码随机森林模型在生存分析中的应用之答禄夫天创作【摘要】目的：本文探讨随机森林方法用于高维度、强相关、小样本的生存资料分析时，可以起到变量筛选的作用。方法：以乳腺癌数据集构建乳腺癌转移风险评估模型为实例进行实证分析，使用随机森林模型进行变量选择，然后拟合cox回归模型。结果：随机森林模型通过对变量的选择，有效的解决数据维度高且强相关的情况，得到了较高的AUC值。一、数据说明该乳腺癌数据集来自于NCBI，有77个观测值以及22286个基因变量。通过筛选选取454个基因变量。将数据随机分为训练集合测试集，其中2/3为训练集，1/3为测试集。绘制K-M曲线图：二、随机森林模

2、型随机森林由许多的决策树组成，因为这些决策树的形成采取了随机的方法，因此也叫做随机决策树。随机森林中的树之间是没有关联的。当测试数据进入随机森林时，其实就是让每一颗决策树进行分类，最后取所有决策树中分类结果最多的那类为最终的结果。因此随机森林是一个包含多个决策树的分类器，而且其输出的类别是由个别树输出的类此外众数而定。使用 randomForestSRC包得到的随机森林模型具有以下性质： Number of deaths: 27 Number of trees: 800 Minimum terminal node size: 3 Average no. of terminal nodes:

3、14.4275No. of variables tried at each split: 3 Total no. of variables: 452 Analysis: RSF Family: surv Splitting rule: logrank Error rate: 19.87%发现直接使用随机森林得到的模型，预测误差很大，达到了19.8%,进一步考虑使用随机森林模型进行变量选择，结果如下： our.rf$rfsrc.refit.obj Sample size: 52 Number of deaths: 19 Number of trees: 500 Minimum terminal

4、node size: 2 Average no. of terminal nodes: 11.554No. of variables tried at each split: 3 Total no. of variables: 9 Analysis: RSF Family: surv Splitting rule: logrank *random* Number of random split points: 10 Error rate: 11.4% our.rf$topvars1 213821_s_at 219778_at 204690_at 220788_s_at 202202_s_at6

5、 211603_s_at 213055_at 219336_s_at 37892_at 一共选取了9个变量，同时误差只有11.4%接下来，使用这些变量做cox回归，剔除模型中不显著（0.01）的变量，最终介入模型建立的变量共有4个。模型结果如下： exp(coef) exp(-coef) lower .95 upper .95218150_at 1.6541 0.6046 0.11086 24.6800200914_x_at 0.9915 1.0086 0.34094 2.8833220788_s_at 0.2649 3.7750 0.05944 1.1805201398_s_at 1.745

6、7 0.5729 0.33109 9.2038201719_s_at 2.4708 0.4047 0.93808 6.5081202945_at 0.4118 2.4284 0.03990 4.2499203261_at 3.1502 0.3174 0.33641 29.4983203757_s_at 0.7861 1.2720 0.61656 1.0024205068_s_at 0.1073 9.3180 0.02223 0.5181最后选取六个变量拟合生存模型，绘制生存曲线如下：下面绘制ROC曲线，分别在训练集和测试集上绘制ROC曲线，结果如下：训练集：测试集：由于测试集上的样本过少，所以

7、得到的AUC值动摇大，考虑使用bootstrap多次计算训练集上的AUC值并求平均来测试模型的效果：AUC at 1 year：0.8039456AUC at 3 year：0.6956907AUC at 5 year：0.7024846由此可以看到，随机森林通过删除贡献较低的变量，完成变量选择的工作，在测试集上具有较高的AUC值，但是比lasso-cox模型得到的AUC略低。附录：load(/R/brea.rda)library(survival)set.seed(10)i-sample(1:77,52)train-dati,test-dat-i,library(randomForestSR

8、C)disease.rf-rfsrc(Surv(time,status).,data = train, ntree = 800,mtry = 3, nodesize = 3,splitrule = logrank)disease.rfour.rf- var.select(object=disease.rf, vdv, method = vh.vimp, nrep = 50)our.rf$rfsrc.refit.objour.rf$topvarsindex-numeric(var.rf$modelsize)for(i in 1:var.rf$modelsize) indexi-which(nam

9、es(dat)=var.rf$topvarsi)data-dat,c(1,2,index)i-sample(1:77,52)train-datai,test-data-i,mod.brea-coxph(Surv(time,status).,data=train)train_data-train,c(1,2,which(summary(mod.brea)$coefficients,5=0.1)+2)tset_data-test,c(1,2,which(summary(mod.brea)$coefficients,5=0.1)+2)mod.brea1-coxph(Surv(time,status)

10、.,data=train_data)summary(mod.brea1)names(coef(mod.brea1)plot(survfit(mod.brea1),xlab=Time,ylab = Proportion,main=Cox Model,conf.int=TRUE,col=c(black,red,red),ylim=c(0.6,1)index0-numeric(length(coef(mod.brea1)coefficients-coef(mod.brea1)name-gsub(,names(coefficients)for(j in 1:length(index0) index0j

11、-which(names(dat)=namej)library(survivalROC)riskscore-as.matrix(dati,index0)%*% as.matrix(coefficients)y1-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict.time=1,span = 0.25*(nrow(train)(-0.20)y3-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict.time=

12、3,span = 0.25*(nrow(train)(-0.20)y5-survivalROC(Stime=train$time,status=train$status,marker=riskscore,predict.time=5,span = 0.25*(nrow(train)(-0.20)a-matrix(data=c(y1,y3,y5,y1$AUC,y3$AUC,y5$AUC),nrow=3,ncol=2);aplot(y1$FP,y1$TP,type=l,xlab=False Positive Rate,ylab = True Positive Rate,main=Time-depe

13、ndent ROC curve,col=green) lines(y3$FP,y3$TP,col=red,lty=2)lines(y5$FP,y5$TP,col=blue,lty=3)legend(bottomright,bty=n,legend = c(AUC at 1 year:0.9271,AUC at 3 years:0.8621,AUC at 5 years:0.8263),col=c(green,red,blue),lty=c(1,2,3),cex=0.9)abline(0,1)riskscore-as.matrix(dat-i,index0)%*% as.matrix(coeff

14、icients)y1-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=1,span = 0.25*(nrow(train)(-0.20)y3-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=3,span = 0.25*(nrow(train)(-0.20)y5-survivalROC(Stime=test$time,status=test$status,marker=riskscore

15、,predict.time=5,span = 0.25*(nrow(train)(-0.20)a-matrix(data=c(y1,y3,y5,y1$AUC,y3$AUC,y5$AUC),nrow=3,ncol=2);aplot(y1$FP,y1$TP,type=l,xlab=False Positive Rate,ylab = True Positive Rate,main=Time-dependent ROC curve,col=green) lines(y3$FP,y3$TP,col=red,lty=2)lines(y5$FP,y5$TP,col=blue,lty=3)legend(bo

16、ttomright,bty=n,legend = c(AUC at 1 year:0.8761,AUC at 3 years:0.7611,AUC at 5 years:0.7611),col=c(green,red,blue),lty=c(1,2,3),cex=0.9)abline(0,1)a-matrix(0,30,3)for (c in 1:30) i-sample(1:77,52) train-datai, test-data-i, mod.brea-coxph(Surv(time,status).,data=train) train_data-train,c(1,2,which(su

17、mmary(mod.brea)$coefficients,5=0.1)+2) tset_data-test,c(1,2,which(summary(mod.brea)$coefficients,5=0.1)+2) mod.brea1-coxph(Surv(time,status).,data=train_data) names(coef(mod.brea1) index0-numeric(length(coef(mod.brea1) coefficients-coef(mod.brea1) name-gsub(,names(coefficients) for(j in 1:length(ind

18、ex0) index0j-which(names(dat)=namej) riskscore-as.matrix(dat-i,index0)%*% as.matrix(coefficients) y1-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=1,span = 0.25*(nrow(train)(-0.20) y3-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=3,span = 0.25*(nrow(train)(-0.20) y5-survivalROC(Stime=test$time,status=test$status,marker=riskscore,predict.time=5,span = 0.25*(nrow(train)(-0.20) ac,-c(y1$AUC,y3$AUC,y5$AUC)

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？