1、edgeRDESeq2分析RNAseq差异表达edgeR 包的安装 edgeR 包是基于Bioconductor平台发布的,所以安装不能直接用install.packages()命令从 CRAN 上来下载 安装:# try http:/ if https:/ URLs are not supportedsource(https:/bioconductor.org/biocLite.R)biocLite(edgeR)数据导入 由于 edgeR 对测序结果的下游分析是依赖 count 计数来进行基因差异表达分析的,在这里使用的是featureCounts来进行统计 .bam文件中 Map 的结果
2、count 结果如下:library(edgeR)mydata sampleNames names(mydata)7:12 head(mydata) Geneid Chr Start End Strand Length CA_1 CA_2 CA_3 CC_1 CC_2 CC_31 gene1314 NW_139421.1 1257 1745 + 489 0 0 0 0 0 02 gene1315 NW_139421.1 2115 3452 + 1338 0 0 0 0 0 03 gene1316 NW_139421.1 3856 4680 + 825 0 0 0 0 0 04 gene1317
3、 NW_139421.1 4866 5435 - 570 0 0 0 0 0 05 gene1318 NW_139421.1 6066 6836 - 771 0 0 0 0 0 06 gene1319 NW_139421.1 7294 9483 + 2190 0 0 0 0 0 0 在这里我们只是需要 Geneid 和后 6 列的样本的 count 信息来组成矩阵,所以要处理下countMatrix rownames(countMatrix) head(countMatrix) CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1314 0 0 0 0 0 0gene1315
4、0 0 0 0 0 0gene1316 0 0 0 0 0 0gene1317 0 0 0 0 0 0gene1318 0 0 0 0 0 0gene1319 0 0 0 0 0 0*要导入的矩阵由3v3样本组成(三组生物学重复)创建 DEGlistgroup y yAn object of class DGEList$counts CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1314 0 0 0 0 0 0gene1315 0 0 0 0 0 0gene1316 0 0 0 0 0 0gene1317 0 0 0 0 0 0gene1318 0 0 0 0 0 0142
5、12 more rows .$samples group lib.size norm.factorsCA_1 CA_1 1788537 1CA_2 CA_2 1825546 1CA_3 CA_3 1903017 1CC_1 CC_1 1826042 1CC_2 CC_2 2124468 1CC_3 CC_3 2025063 1过滤 过滤掉那些 count 结果都为0的数据,这些没有表达的基因对结果的分析没有用,过滤又两点好处:1 可以减少内存的压力 2 可以减少计算的压力keep 1) = 2y yAn object of class DGEList$counts CA_1 CA_2 CA_3
6、 CC_1 CC_2 CC_3gene1321 161 138 129 218 194 220gene1322 2 3 1 1 3 3gene1323 20 27 33 47 51 46gene1324 60 87 79 86 100 132gene1325 32 29 21 58 75 563877 more rows .$samples group lib.size norm.factorsCA_1 CA_1 1788362 1CA_2 CA_2 1825308 1CA_3 CA_3 1902796 1CC_1 CC_1 1825889 1CC_2 CC_2 2124155 1CC_3 C
7、C_3 2024786 1标准化处理 edgeR采用的是 TMM 方法进行标准化处理,只有标准化处理后的数据才又可比性y yAn object of class DGEList$counts CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene1321 161 138 129 218 194 220gene1322 2 3 1 1 3 3gene1323 20 27 33 47 51 46gene1324 60 87 79 86 100 132gene1325 32 29 21 58 75 563877 more rows .$samples group lib.size nor
8、m.factorsCA_1 CA_1 1788362 0.9553769CA_2 CA_2 1825308 0.9052539CA_3 CA_3 1902796 0.9686232CC_1 CC_1 1825889 0.9923455CC_2 CC_2 2124155 1.1275178CC_3 CC_3 2024786 1.0668754设计矩阵 为什么要一个设计矩阵呢,道理很简单,有了一个设计矩阵才能够更好的分组分析subGroup design rownames(design) design (Intercept) subGroup2 subGroup3 groupCCCA_1 1 0
9、0 0CA_2 1 1 0 0CA_3 1 0 1 0CC_1 1 0 0 1CC_2 1 1 0 1CC_3 1 0 1 1attr(,assign)1 0 1 1 2attr(,contrasts)attr(,contrasts)$subGroup1 contr.treatmentattr(,contrasts)$group1 contr.treatment评估离散度y y$common.dispersion1 0.02683622#plotplotBCV(y)差异表达基因 fit qlf topTags(qlf)Coefficient: groupCC logFC logCPM F PV
10、alue FDRgene7024 -5.515648 9.612809 594.9232 6.431484e-44 2.496702e-40gene6612 5.130282 8.451143 468.2060 1.557517e-39 3.023140e-36gene2743 4.377492 5.586773 208.0268 3.488383e-26 4.513967e-23gene12032 4.734383 5.098148 192.9378 4.359649e-25 4.231040e-22gene491 -2.733910 10.412673 190.9839 6.104188e
11、-25 4.739291e-22gene8941 2.997185 6.839106 177.7614 6.332836e-24 4.097345e-21gene2611 -2.846924 7.216173 174.7332 1.099339e-23 6.096619e-21gene6242 2.529125 9.897771 169.2658 3.022914e-23 1.466869e-20gene7252 3.732315 6.137670 188.2094 3.890569e-23 1.678132e-20gene6125 2.875423 6.569935 160.3189 1.6
12、56083e-22 6.428914e-20查看差异表达基因原始的 CMP top cpm(y)top, CA_1 CA_2 CA_3 CC_1 CC_2 CC_3gene7024 1711.383002 1405.861899 1480.121115 33.11418 37.16040 29.62696gene6612 17.558649 12.103848 26.585753 403.99298 582.45796 1044.35046gene2743 4.682306 1.815577 5.968230 62.91694 87.26431 114.34156gene12032 1.755
13、865 2.420770 2.712832 65.67646 47.59872 75.45617gene491 2811.139727 2059.469669 2222.351938 444.83381 385.38258 253.68087gene8941 23.996820 24.812888 24.415488 131.35291 244.67410 225.90560gene2611 245.821088 310.463691 225.165052 43.04843 26.30455 39.81123gene6242 231.188880 299.570228 298.411515 1
14、348.29899 1343.61988 2191.93237gene7252 9.364613 13.314232 5.425664 92.71970 108.55847 181.92807gene6125 23.411532 14.524617 29.841152 145.70239 160.75005 185.16852查看上调和下调基因的数目 summary(dt isDE DEnames head(DEnames)1 gene1325 gene1326 gene1327 gene1331 gene1340 gene1343差异表达基因画图plotSmear(qlf, de.tags=
15、DEnames)abline(h=c(-1,1), col=blue) DESeq2 包的安装 安装:# try http:/ if https:/ URLs are not supportedsource(https:/bioconductor.org/biocLite.R)biocLite(DESeq2)数据导入 导入count 矩阵,导入数据的方式很多这里直接导入 count 矩阵 count 结果如下:library(DESeq2)sampleNames - c(CA_1,CA_2,CA_3,CC_1,CC_2,CC_3)mydata - read.table(counts.txt,
16、header = TRUE, quote = t,skip =1)names(mydata)7:12 - sampleNamescountMatrix - as.matrix(mydata7:12)rownames(countMatrix) -mydata$Geneidtable2 - data.frame(name = c(CA_1,CA_2,CA_3,CC_1,CC_2,CC_3),condition = (CA,CA,CA,CC,CC,CC)rownames(table2) dds ddsclass: DESeqDataSet dim: 14217 6 metadata(0):assay
17、s(1): countsrownames(14217): gene1314 gene1315 . gene6710 gene6709rowRanges metadata column names(0):colnames(6): CA_1 CA_2 . CC_2 CC_3colData names(2): name condition过滤 过滤掉那些 count 结果都为 0 的数据,这些没有表达的基因对结果的分析没有用dds 1, ddsclass: DESeqDataSet dim: 4190 6 metadata(0):assays(1): countsrownames(4190): ge
18、ne1321 gene1322 . gene6712 gene6710rowRanges metadata column names(0):colnames(6): CA_1 CA_2 . CC_2 CC_3colData names(2): name conditionPCA分析rld - rlog(dds)plotPCA(rld, intgroup=c(name,condition) 当然也可以使用 ggplot2来画PCA 图library(ggplot2)rld - rlog(dds)data - plotPCA(rld, intgroup=c(condition, name), re
19、turnData=TRUE)percentVar - round(100 * attr(data, percentVar)p- ggplot(data, aes(PC1, PC2, color=condition, shape=name) +geom_point(size=3) +xlab(paste0(PC1: ,percentVar1,% variance) +ylab(paste0(PC2: ,percentVar2,% variance)p 注意在进行 PCA 分析前不要library(DESeq)否则无法进行 PCA 分析差异表达基因分析分析结果输出library(DESeq)dds
20、 - DESeq(dds)res - results(dds)write.table(res,result.csv, sep = , row.names = TRUE)head(res)log2 fold change (MAP): condition CC vs CA Wald test p-value: condition CC vs CA DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue gene1321 173.288681 0.26267959 0.2049983 1.28137
21、42 2.000623e-01gene1322 2.118367 -0.05237952 0.4989589 -0.1049776 9.163936e-01gene1323 35.973701 0.50054580 0.3038096 1.6475641 9.944215e-02gene1324 88.421661 0.17677605 0.2402727 0.7357309 4.618945e-01gene1325 43.001828 0.81143104 0.2919396 2.7794486 5.445127e-03gene1326 662.136259 -1.05356105 0.17
22、52230 -6.0126880 1.824720e-09 padj gene1321 3.790396e-01gene1322 9.559679e-01gene1323 2.337858e-01gene1324 6.565731e-01gene1325 2.447141e-02gene1326 4.520861e-08 注: (1)rownames: 基因 ID (2)baseMean:所有样本矫正后的平均 reads 数 (3)log2FoldChange:取 log2 后的表达量差异 (4)pvalue:统计学差异显著性检验指标 (5)padj:校正后的 pvalue, padj 越小,
23、表示基因表达差异越显著 summary查看整体分析结果summary(res)out of 4190 with nonzero total read countadjusted p-value 0 (up) : 595, 14% LFC 0 (down) : 644, 15% outliers 1 : 0, 0% low counts 2 : 325, 7.8% (mean count 1)1 see cooksCutoff argument of ?results2 see independentFiltering argument of ?resultsMA 图library(genepl
24、otter)plotMA(res, main=DESeq2, ylim=c(-2,2)Heatmap 图sum(res$padj 0.1, na.rm=TRUE)library(pheatmap)select - order(rowMeans(counts(dds,normalized=TRUE),decreasing=TRUE)1:1000nt - normTransform(dds) # defaults to log2(x+1)log2.norm.counts - assay(nt)select,df - as.data.frame(colData(dds),c(name,condition)pdf(heatmap1000.pdf,width = 6, height = 7)pheatmap(log2.norm.counts, cluster_rows=TRUE, show_rownames=FALSE,cluster_cols=TRUE, annotation_col=df)dev.off()
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1