edgeRDESeq2分析RNAseq差异表达.docx

资源描述

edgeRDESeq2分析RNAseq差异表达.docx

《edgeRDESeq2分析RNAseq差异表达.docx》由会员分享，可在线阅读，更多相关《edgeRDESeq2分析RNAseq差异表达.docx（13页珍藏版）》请在冰豆网上搜索。

edgeRDESeq2分析RNAseq差异表达.docx

edgeRDESeq2分析RNAseq差异表达

edgeR包的安装

∙edgeR包是基于 Bioconductor 平台发布的，所以安装不能直接用 install.packages（）命令从CRAN上来下载

∙安装:

#tryhttp:

//ifhttps:

//URLsarenotsupported

>source（"https:

//bioconductor.org/biocLite.R"）

>biocLite（"edgeR"）

数据导入

∙由于edgeR对测序结果的下游分析是依赖count计数来进行基因差异表达分析的，在这里使用的是featureCounts 来进行统计｀.bam｀　文件中Map的结果

∙count结果如下:

>library（edgeR）

>mydata<-read.table（"counts.txt",header=TRUE,quote='\t',skip=1）

>sampleNames<-c（"CA_1","CA_2","CA_3","CC_1","CC_2","CC_3"）

>names（mydata）[7:

12]<-sampleNames

>head（mydata）

GeneidChrStartEndStrandLengthCA_1CA_2CA_3CC_1CC_2CC_3

1gene1314NW_139421.112571745+489000000

2gene1315NW_139421.121153452+1338000000

3gene1316NW_139421.138564680+825000000

4gene1317NW_139421.148665435-570000000

5gene1318NW_139421.160666836-771000000

6gene1319NW_139421.172949483+2190000000

∙在这里我们只是需要Geneid和后6列的样本的count信息来组成矩阵，所以要处理下

>countMatrix<-as.matrix（mydata[7:

12]）

>rownames（countMatrix）<-mydata$Geneid

>head（countMatrix）

CA_1CA_2CA_3CC_1CC_2CC_3

gene1314000000

gene1315000000

gene1316000000

gene1317000000

gene1318000000

gene1319000000

*要导入的矩阵由3v3样本组成（三组生物学重复）

创建DEGlist

>group<-factor（c（"CA","CA","CA","CC","CC","CC"））

>y<-DGEList（counts=countMatrix,group=group）

Anobjectofclass"DGEList"

$counts

CA_1CA_2CA_3CC_1CC_2CC_3

gene1314000000

gene1315000000

gene1316000000

gene1317000000

gene1318000000

14212morerows...

$samples

grouplib.sizenorm.factors

CA_1CA_117885371

CA_2CA_218255461

CA_3CA_319030171

CC_1CC_118260421

CC_2CC_221244681

CC_3CC_320250631

过滤

∙过滤掉那些count结果都为0的数据，这些没有表达的基因对结果的分析没有用，过滤又两点好处:

1可以减少内存的压力2可以减少计算的压力

>keep<-rowSums（cpm（y）>1）>=2

>y<-y[keep,,keep.lib.sizes=FALSE]

Anobjectofclass"DGEList"

$counts

CA_1CA_2CA_3CC_1CC_2CC_3

gene1321161138129218194220

gene1322231133

gene1323202733475146

gene132460877986100132

gene1325322921587556

3877morerows...

$samples

grouplib.sizenorm.factors

CA_1CA_117883621

CA_2CA_218253081

CA_3CA_319027961

CC_1CC_118258891

CC_2CC_221241551

CC_3CC_320247861

标准化处理

∙edgeR采用的是TMM方法进行标准化处理,只有标准化处理后的数据才又可比性

>y<-calcNormFactors（y）

Anobjectofclass"DGEList"

$counts

CA_1CA_2CA_3CC_1CC_2CC_3

gene1321161138129218194220

gene1322231133

gene1323202733475146

gene132460877986100132

gene1325322921587556

3877morerows...

$samples

grouplib.sizenorm.factors

CA_1CA_117883620.9553769

CA_2CA_218253080.9052539

CA_3CA_319027960.9686232

CC_1CC_118258890.9923455

CC_2CC_221241551.1275178

CC_3CC_320247861.0668754

设计矩阵

∙为什么要一个设计矩阵呢，道理很简单，有了一个设计矩阵才能够更好的分组分析

>subGroup<-factor（substring（colnames（countMatrix）,4,4））

>design<-model.matrix（~subGroup+group）

>rownames（design）<-colnames（y）

>design

（Intercept）subGroup2subGroup3groupCC

CA_11000

CA_21100

CA_31010

CC_11001

CC_21101

CC_31011

attr（,"assign"）

[1]0112

attr（,"contrasts"）

attr（,"contrasts"）$subGroup

[1]"contr.treatment"

attr（,"contrasts"）$group

[1]"contr.treatment"

评估离散度

>y<-estimateDisp（y,design,robust=TRUE）

>y$common.dispersion

[1]0.02683622

#plot

>plotBCV（y）

差异表达基因

>fit<-glmQLFit（y,design,robust=TRUE）

>qlf<-glmQLFTest（fit）

>topTags（qlf）

Coefficient:

groupCC

logFClogCPMFPValueFDR

gene7024-5.5156489.612809594.92326.431484e-442.496702e-40

gene66125.1302828.451143468.20601.557517e-393.023140e-36

gene27434.3774925.586773208.02683.488383e-264.513967e-23

gene120324.7343835.098148192.93784.359649e-254.231040e-22

gene491-2.73391010.412673190.98396.104188e-254.739291e-22

gene89412.9971856.839106177.76146.332836e-244.097345e-21

gene2611-2.8469247.216173174.73321.099339e-236.096619e-21

gene62422.5291259.897771169.26583.022914e-231.466869e-20

gene72523.7323156.137670188.20943.890569e-231.678132e-20

gene61252.8754236.569935160.31891.656083e-226.428914e-20

查看差异表达基因原始的CMP

>top<-rownames（topTags（qlf））

>cpm（y）[top,]

CA_1CA_2CA_3CC_1CC_2CC_3

gene70241711.3830021405.8618991480.12111533.1141837.1604029.62696

gene661217.55864912.10384826.585753403.99298582.457961044.35046

gene27434.6823061.8155775.96823062.9169487.26431114.34156

gene120321.7558652.4207702.71283265.6764647.5987275.45617

gene4912811.1397272059.4696692222.351938444.83381385.38258253.68087

gene894123.99682024.81288824.415488131.35291244.67410225.90560

gene2611245.821088310.463691225.16505243.0484326.3045539.81123

gene6242231.188880299.570228298.4115151348.298991343.619882191.93237

gene72529.36461313.3142325.42566492.71970108.55847181.92807

gene612523.41153214.52461729.841152145.70239160.75005185.16852

查看上调和下调基因的数目

>summary（dt<-decideTestsDGE（qlf））

[,1]

-1536

02793

1553

挑选出差异表达基因的名字

>isDE<-as.logical（dt）

>DEnames<-rownames（y）[isDE]

>head（DEnames）

[1]"gene1325""gene1326""gene1327""gene1331""gene1340""gene1343"

差异表达基因画图

>plotSmear（qlf,de.tags=DEnames）

>abline（h=c（-1,1）,col="blue"）

DESeq2包的安装

∙安装:

##tryhttp:

//ifhttps:

//URLsarenotsupported

>source（"https:

//bioconductor.org/biocLite.R"）

>biocLite（"DESeq2"）

数据导入

∙导入count矩阵，导入数据的方式很多这里直接导入count矩阵

∙count结果如下:

library（DESeq2）

sampleNames<-c（"CA_1","CA_2","CA_3","CC_1","CC_2","CC_3"）

mydata<-read.table（"counts.txt",header=TRUE,quote='\t',skip=1）

names（mydata）[7:

12]<-sampleNames

countMatrix<-as.matrix（mydata[7:

12]）

rownames（countMatrix）<-mydata$Geneid

table2<-data.frame（name=c（"CA_1","CA_2","CA_3","CC_1","CC_2","CC_3"）,condition=（"CA","CA","CA","CC","CC","CC"））

rownames（table2）<-sampleNames

head（countMatrix）

CA_1CA_2CA_3CC_1CC_2CC_3

gene1314000000

gene1315000000

gene1316000000

gene1317000000

gene1318000000

gene1319000000

∙把count矩阵转化为DESeq2　的数据格式

>dds<-DESeqDataSetFromMatrix（countMatrix,colData=table2,design=~condition）

>dds

class:

DESeqDataSet

dim:

142176

metadata（0）:

assays

（1）:

counts

rownames（14217）:

gene1314gene1315...gene6710gene6709

rowRangesmetadatacolumnnames（0）:

colnames（6）:

CA_1CA_2...CC_2CC_3

colDatanames

（2）:

namecondition

过滤

∙过滤掉那些count结果都为0的数据，这些没有表达的基因对结果的分析没有用

dds<-dds[rowSums（counts（dds））>1,]

dds

class:

DESeqDataSet

dim:

41906

metadata（0）:

assays

（1）:

counts

rownames（4190）:

gene1321gene1322...gene6712gene6710

rowRangesmetadatacolumnnames（0）:

colnames（6）:

CA_1CA_2...CC_2CC_3

colDatanames

（2）:

namecondition

PCA分析

rld<-rlog（dds）

plotPCA（rld,intgroup=c（"name","condition"））

∙当然也可以使用ggplot2　来画　PCA图

library（ggplot2）

rld<-rlog（dds）

data<-plotPCA（rld,intgroup=c（"condition","name"）,returnData=TRUE）

percentVar<-round（100*attr（data,"percentVar"））

p<-ggplot（data,aes（PC1,PC2,color=condition,shape=name））+

geom_point（size=3）+

xlab（paste0（"PC1:

",percentVar[1],"%variance"））+

ylab（paste0（"PC2:

",percentVar[2],"%variance"））

∙注意在进行PCA分析前不要 library（DESeq）否则无法进行PCA分析

差异表达基因分析

分析结果输出

library（DESeq）

dds<-DESeq（dds）

res<-results（dds）

write.table（res,"result.csv",sep=",",row.names=TRUE）

head（res）

log2foldchange（MAP）:

conditionCCvsCA

Waldtestp-value:

conditionCCvsCA

DataFramewith6rowsand6columns

baseMeanlog2FoldChangelfcSEstatpvalue

gene1321173.2886810.262679590.20499831.28137422.000623e-01

gene13222.118367-0.052379520.4989589-0.10497769.163936e-01

gene132335.9737010.500545800.30380961.64756419.944215e-02

gene132488.4216610.176776050.24027270.73573094.618945e-01

gene132543.0018280.811431040.29193962.77944865.445127e-03

gene1326662.136259-1.053561050.1752230-6.01268801.824720e-09

padj

gene13213.790396e-01

gene13229.559679e-01

gene13232.337858e-01

gene13246.565731e-01

gene13252.447141e-02

gene13264.520861e-08

∙注：

（1）rownames:

基因ID

（2）baseMean:

所有样本矫正后的平均reads数（3）log2FoldChange:

取log2后的表达量差异（4）pvalue:

统计学差异显著性检验指标（5）padj:

校正后的pvalue,padj越小,表示基因表达差异越显著

∙summary　查看整体分析结果

summary（res）

outof4190withnonzerototalreadcount

adjustedp-value<0.1

LFC>0（up）:

595,14%

LFC<0（down）:

644,15%

outliers[1]:

0,0%

lowcounts[2]:

325,7.8%

（meancount<1）

[1]see'cooksCutoff'argumentof?

results

[2]see'independentFiltering'argumentof?

results

MA图

library（geneplotter）

plotMA（res,main="DESeq2",ylim=c（-2,2））

Heatmap图

sum（res$padj<0.1,na.rm=TRUE）

library（"pheatmap"）

select<-order（rowMeans（counts（dds,normalized=TRUE））,decreasing=TRUE）[1:

1000]

nt<-normTransform（dds）#defaultstolog2（x+1）

log2.norm.counts<-assay（nt）[select,]

df<-as.data.frame（colData（dds）[,c（"name","condition"）]）

pdf（'heatmap1000.pdf',width=6,height=7）

pheatmap（log2.norm.counts,cluster_rows=TRUE,show_rownames=FALSE,

cluster_cols=TRUE,annotation_col=df）

dev.off（）

展开阅读全文