RNA_seq流程

假期姑且总结一下……

1.RNA_seq数据下载

在NCBI GEO
DataSet中选择GSE107744作为所选数据集。该数据为衰老酵母MEP细胞的野生型(wt)、swd1、spp1、set1CD四种类型分别进行YPD 0h/ 7.5h/ 24h/ 48h。芯片从GSM2877741到GSM2877794共计54张芯片。

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107744

获取每张芯片对应的SRR编号

对应到EBI数据库中,获取各SRR对应的ftp

下载并配置Aspera的运行环境

Aspera是高速下载的工具,最快可达到100M/s。

进入Aspera官网的Downloads界面,选中aspera connect
server,点击Wwindows图标,选择v3.6.2版本,点击Download进行下载。

配置Aspera的运行环境

#安装aspera connect
tar zxf aspera-connect-3.5.1.92523-linux-64.tar.gz
$ sh aspera-connect-3.5.1.92523-linux-64.sh
#将aspera运行文件复制到运行路径(或者直接添加环境变量)
cd #进入根目录
ls -a #检查是否已安装好.aspera
.aspera/ect/
#进入etc目录
cp -r etc/. /mnt/
#开放端口
#进入root
su
iptables -A INPUT -p udp --dport 33001 -j ACCEPT
iptables -A OUTPUT -p udp --dport 33001 -j ACCEPT
iptables -L -n
exit

#下载方法(以SRR6350254为例)
ascp -QT -l 500M -P 33001 -i asperaweb_id_dsa.openssh era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR635/004/SRR6350254/SRR6350254.fastq.gz /mnt/sra

2.FASTQC质控分析

#利用fastqc分析所有质控
fastqc new_SRR6350254.fastq
#用multiqc工具总结
bash Anaconda2-5.3.1-Linux-x86_64.sh
conda install -c bioconda multiqc
multiqc --help
multiqc . #在包含所有fastqc.zip的文件夹中运行

图 1 序列质量分布 图 2 序列质量打分

图 3 GC值情况 图 4 N值情况

图 5 序列重复水平 图 6 过表达情况

  1. Trimmomatic序列修剪

    for i in `seq 197 254`
    do
    java -jar ./trimmomatic-0.39.jar SE -threads 6 -phred33 -trimlog SRR6350${i}_trimmomatic.log SRR6350${i}.fastq new_SRR6350${i}.fastq HEADCROP:11
    done

运行Trimmomatic bash trimmomatic.sh

  1. TopHat2+cufflinks流程

4.1 tophat2序列比对

下载tophat2,解压,将文件软连接到环境变量中

sudo ln -s /mnt/g/RNA_seq/tophat-2.1.1.Linux_x86_64/tophat2
/usr/local/bin/tophat2

for i in `seq 197 254`
do
tophat2 -p 4 -G /mnt/g/RNA_seq/sra/tophat2_align/Saccharomyces_cerevisiae.R64-1-1.86.gtf -o tophat2_align/new_SRR6350${i}_thout_new/ /mnt/g/RNA_seq/sra/bowtie2_align/Bowtie2Index/genome /mnt/g/RNA_seq/sra/new_SRR6350${i}.fastq >new_SRR6350${i}_tophat.log
done

运行tophat nohup ./toptat2.sh &

结果回显

4.2 Cufflinks序列组装

Cufflinks安装,将文件软连接到环境变量中

./configure --prefix=/path/to/cufflinks/install --with-boost=/usr/local/ --with-eigen=/usr/local/include//Eigen/
make
make install

注意,Cufflinks软件安装需要在python2.7条件下使用,因此要切换默认python为2.7

cuffdiff -o diff_out_new -b cufflinks/genome.fa -p 8 -L set1CD_24h1,set1CD_24h2,set1CD_48h3,set1CD_48h4,set1CD_7.5h5,set1CD_7.5h6,set1CD_7.5h7,set1CD_0h8,set1CD_0h9,spp1_24h10,spp1_24h11,spp1_48h12,spp1_48h13,spp1_48h14,spp1_48h15,spp1_48h16,spp1_7.5h17,spp1_7.5h18,spp1_7.5h19,spp1_0h20,spp1_0h21,spp1_0h22,spp1_0h23,spp1_0h24,swd1_24h25,swd1_24h26,swd1_24h27,swd1_48h28,swd1_48h29,swd1_48h30,swd1_48h31,swd1_7.5h32,swd1_7.5h33,swd1_7.5h34,swd1_0h35,swd1_0h36,swd1_0h37,swd1_0h38,wt_24h39,wt_24h40,wt_24h41,wt_24h42,wt_24h43,wt_48h44,wt_48h45,wt_48h46,wt_48h47,wt_48h48,wt_7.5h49,wt_7.5h50,wt_7.5h51,wt_7.5h52,wt_7.5h53,wt_0h54,wt_0h55,wt_0h56,wt_0h57,wt_0h58 -u cufflinks/result.gtf ./tophat2_align/new_SRR6350197_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350198_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350199_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350200_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350201_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350202_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350203_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350204_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350205_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350206_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350207_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350208_thout_new/accepted_hits.bam ……(全部样本)

大概跑了5个小时后提示segmentation fault(core drump)

猜测是内存不够,只能跑了10个样本

cuffdiff -o diff_out -b cufflinks/genome.fa -p 8 -L C1,C2,C3,C4,C5,T1,T2,T3,T4,T5 -u cufflinks/result.gtf ./tophat2_align/new_SRR6350208_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350209_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350210_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350211_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350212_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350216_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350217_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350218_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350219_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350220_thout_new/accepted_hits.bam

从生成文档中选择genes.fpkm_tracking文档

内容回显

此为TopHat2+cufflinks流程的结果文档

  1. Hisat2+StringTie+Ballgown流程

5.1 Hisat2+stringtie

安装Hisat2、StringTie,将文件软连接到环境变量中

sudo ln -s /mnt/g/RNA_seq/hisat2-2.1.0-Linux_x86_64/hisat2-2.1.0/hisat2
/usr/local/bin/hisat2

sudo ln -s /mnt/g/RNA_seq/stringtie-1.3.5/stringtie /usr/local/bin/stringtie

for i in `seq 197 254`
do
hisat2 -p 4 -x /mnt/g/RNA_seq/sra/hisat2_align/r64_tran/genome_tran -U /mnt/g/RNA_seq/sra/new_SRR6350${i}.fastq -S /mnt/g/RNA_seq/sra/hisat2_align/hisat2/hisat2_SRR6350${i}.sam
cd hisat2_align/hisat2
samtools view -S hisat2_SRR6350${i}.sam -b > hisat2_SRR6350${i}.bam
samtools sort hisat2_SRR6350${i}.bam hisat2_SRR6350${i}_sorted
stringtie hisat2_SRR6350${i}_sorted.bam -p 4 -G /mnt/g/RNA_seq/sra/hisat2_align/stringtie/Saccharomyces_cerevisiae.R64-1-1.86.gtf -o stringtie/new_SRR6350${i}.gtf
done

运行 nohup ./hisat2.sh &

5.2 Ballgown

Ballgown是R语言的程序包,在R studio中运行

# devtools::install_github('alyssafrazee/RSkittleBrewer', force=TRUE)
library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(devtools)
setwd("G:/RNA_seq/sra")
pheno_data <- read.table("ballgown/phenodata.txt",header = T)
bg = ballgown(dataDir = "ballgown", samplePattern = "SRR", pData=pheno_data)
#滤掉了样本间差异少于一个转录本的数据
bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE)
result_transcripts=stattest(bg_filt,feature = "transcript",covariate = "celllines",adjustvars = c("time"), getFC=TRUE,meas="FPKM")
result_genes=stattest(bg_filt,feature = "gene",covariate = "celllines",adjustvars = c("time"), getFC=TRUE,meas="FPKM")
result_transcripts=data.frame(geneNames=ballgown::geneNames(bg_filt),geneIDs=ballgown::geneIDs(bg_filt),result_transcripts)
result_transcripts=arrange(result_transcripts,pval)
result_genes=arrange(result_genes,pval)
#把结果写入csv文件
write.csv(result_transcripts, "transcript_results.csv",row.names=FALSE)
write.csv(result_genes, "gene_results.csv",row.names=FALSE)
data <- subset(result_transcripts,result_transcripts$qval<0.001)
gene <- subset(result_genes,result_genes$qval<0.001)
dim(gene)
# 提取FPKM值
#方便作图将其log转换fpkm = texpr(bg_filt,meas="FPKM")
fpkm = log2(fpkm+1)
write.table(fpkm,"fpkm.txt",quote=FALSE)

查看各样本的表达情况

#设置颜色
tropical= c('darkorange', 'dodgerblue','hotpink', 'limegreen', 'yellow')
palette(tropical)
setwd("G:/RNA_seq/sra/ballgown")
pdf('boxplot.pdf',width=10, height=10)
boxplot(fpkm,col=as.numeric(pheno_data$time),las=2,ylab='log2(FPKM+1)')
dev.off()

图 7 各样本表达情况

就结果而言,各样本除224外的表达值相差不大

就单个转录本的查看在样品中的分布

ballgown::transcriptNames(bg_filt)[10]
ballgown::geneNames(bg_filt)[10]
str = paste(ballgown::geneNames(bg_filt)[10],' : ',ballgown::transcriptNames(bg_filt)[10])
str
# TDA8:YAL064C-A
pdf('fpkm_boxplot.pdf',width=10, height=10)
boxplot(fpkm[12,] ~ pheno_data$celllines, border=c(1,2), main="TDA8:YAL064C-A", pch=19, xlab="celllines", ylab='log2(FPKM+1)')
points(fpkm[12,] ~ jitter(as.numeric(pheno_data$celllines)),col=as.numeric(pheno_data$celllines))
dev.off()

图 8 TDA8在样品中的分布

剪切方式分析

pdf('plotTranscripts.pdf',width=10, height=10)
plotTranscripts(ballgown::geneIDs(bg_filt)[1679], bg_filt, main=c('Gene MSTRG.1517.3 in sample SRR6350197'), sample=c(' SRR6350197'))
dev.off()
#以处理方式为区分查看基因表达情况
png('plotMeans.png')
plotMeans('MSTRG.1517', bg_filt,groupvar="celllines",legend=FALSE)
dev.off()

图 9 特定基因在特定样本中的表达情况

  1. 差异表达+聚类分析

6.1 TopHat2 + Cufflinks流程

setwd("G:/RNA_seq/sra/diff_out")
gene<-read.table("using_fpkm.txt",header=T)
gene0<-gene
row.names(gene)<-gene[,1]
gene<-gene[,-1]

p_value<-c()
t.test()
spp4<-gene[,c(1,2,3,4,5)]
spp0<-gene[,c(6,7,8,9,10)]

for(i in 1:nrow(gene)){
p_value[i]<-t.test(spp4[i,],spp0[i,])$p.value
}
gene$p_value<-p_value
gene_p_005 <- gene[p_value<0.05,]

gene_p_001 <- gene[p_value<0.01,]

#倍数法
spp4_avg<-apply(spp4,1,mean)
spp0_avg<-apply(spp0,1,mean)
gene$spp4_avg<-spp4_avg
gene$spp0_avg<-spp0_avg
gene_fold_2 <- gene[2^(log2(spp4_avg/spp0_avg))>2,]

gene_fold_3 <- gene[2^(log2(spp4_avg/spp0_avg))>3,]

#筛选
#t 0.01 倍数 2
gene_t005_fold2<-gene_fold_2[gene_fold_2$p_value<0.05,]

gene$avg_cha<-abs(spp4_avg-spp0_avg)
gene_z<-vector(length=nrow(gene))
for(i in 1:nrow(gene)){
gene_z[i]<-(gene$avg_cha[i]-mean(gene$avg_cha))/sd(gene$avg_cha)
}
gene$gene_z<-abs(gene_z)


gene_zscore_196<-gene[gene$gene_z>1.96,]
# gene_t001_z005<-gene[gene_fold_2$gene_z>1.96,]
gene_zscore_258<-gene[gene$gene_z>2.58,]

gene_t001_z001<-gene_zscore_258[gene_zscore_258$p_value<0.01,]

gene_t001_fold2_z005<-gene_t001_fold2[gene_t001_fold2$gene_z>1.96,]


gene_chayi<-gene_t001_z001[,-11]
gene_chayi<-gene_chayi[,-11]
gene_chayi<-gene_chayi[,-11]
gene_chayi<-gene_chayi[,-11]
gene_chayi<-gene_chayi[,-11]
treats<-c("48h","48h","48h","48h","48h","0h","0h","0h","0h","0h")
library(ggplot2)
library(pheatmap)
heatmap<-function(data,mainname,path){
pdf(path,onefile = FALSE,width=10, height=10) #保存地址(save PATH),onefile删除了第一页的空白
annotation_col = data.frame(Type = factor(treats))
rownames(annotation_col) = paste(colnames(data)) #构建annotation_col
#绘制heatmap(Drawing the heatmap)
pheatmap(data,
#cluster_col = FALSE,
scale="row", #按行聚类(Clustering by row)
treeheight_row=100,treeheight_col=20, #去除聚类树(Remove the clustering tree)
#加颜色(color)
color=colorRampPalette(c('green','black','red'))(50),
show_rownames=F, #去行名(Remove rowname)
annotation_col = annotation_col, #将轨道赋给heatmap(Assign orbits to heatmap)
main=mainname,
border_color=NA
)
dev.off()
}
heatmap(data=gene_chayi,mainname="Heatmap Of t.test p<005 fold3",path="E:/生信/大三/转录组学/heatmap_t001_fold2.pdf")

图 10 cuffdiff-heatmap

6.2 Hisat2+StringTie+Ballgown流程

fpkm<-read.table("fpkm.txt",header=T)
gene<-cbind(fpkm[,10:24],fpkm[,39:58])
gene<-as.data.frame(gene)

p_value<-vector(length=nrow(gene))
t.test()
nwt<-gene[,c(1:14)]
wt<-gene[,c(15:34)]
for(i in 1:nrow(gene)){
p_value[i]<-t.test(nwt[i,],wt[i,])$p.value
}
gene$p_value<-p_value
gene_p_005 <- gene[p_value<0.05,]
gene_p_0001 <- gene[p_value<0.001,]
gene_chayi<-gene_p_0001[,-36]
gene_chayi<-gene_chayi[,-36]
gene_chayi<-gene_chayi[,-36]
gene_chayi<-gene_chayi[,-36]
gene_chayi<-gene_chayi[,-36]
treats<-c("spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1","spp1",
"wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt","wt")
library(ggplot2)
library(pheatmap)
heatmap<-function(data,mainname,path){
pdf(path,onefile = FALSE,width=10, height=10) #保存地址(save PATH),onefile删除了第一页的空白
annotation_col = data.frame(Type = factor(treats))
rownames(annotation_col) = paste(colnames(data)) #构建annotation_col
#绘制heatmap(Drawing the heatmap)
pheatmap(data,
#cluster_col = FALSE,
scale="row", #按行聚类(Clustering by row)
treeheight_row=100,treeheight_col=20, #去除聚类树(Remove the clustering tree)
#加颜色(color)
color=colorRampPalette(c('green','black','red'))(50),
show_rownames=F, #去行名(Remove rowname)
annotation_col = annotation_col, #将轨道赋给heatmap(Assign orbits to heatmap)
main=mainname,
border_color=NA
)
dev.off()
}
heatmap(data=gene_chayi,mainname="Heatmap Of t.test p<0.001",path="E:/生信/大三/转录组学/heatmap_ballgown.pdf")

图 11 Ballgown-heatmap

两组数据都有成功地将各自的差异表达数据类分离开,下面是两组数据所提取的差异表达值

#ballgown
SFA1
YCG1
STE14
PAD1
FDC1
KTI11
QDR2
RPI1
KTR7
MPH1
MF(ALPHA)2
ATE1
ECT1
CPR2
IMD2
PRP21
MOG1
STE3
CTK1
CWP1
MUD2
RHO4
LAS1
BNA5
MGR3
TRS130
DSS1
PMS1
SMC5
ATX2
MF(ALPHA)1
SSU1
BTS1
VPS16
TAF3
KAR3
MATALPHA1
SKI3
OPT2
ARR1
CTR86

#cuffdiff
RPP1A
PDC1
RPL38
fba1
RPS15
RPL15A
YDR524C-B
ADH1
RPS21A
RPS29B
TPI1
RPS5
ENO2
RPP2A
RPS27B
RpS28a
RPS25A
RPS2
RPP2B
RPL32
HSP12
RPS3
RPL31A
SeD1
RPS12
RPS20
TDH2
RPL27A
RPS1B
RPS26A
RPL3
RPS9B
hyp2
RPL6A
EFB1
RPP1B
RPS22A
RPL30
AHP1
RpS10a
TDH3
RPS17A
RPL26B
rpl25
YDR524W-C
SPO24
Rps10b
RpS14a
TMA19
RpL34b
rps21b
RPS13
GPM1
RPL29
RPS26B

![http://www.biovenn.nl/0.13014500/biovenn.png](21bb4a1cd1aba0239ba4a7d1cbb6f3ee.png %}

图 12 ballgown-cuffdiff

二者之间并没有交集,但这并不能说明两种方法完全不同。首先,二者进行差异表达所采用的方式不同,ballgown是t检验p<0.001,cuffdiff是t检验p<0.01
z检验p<0.01;其次,二者都是从几千个基因差异表达到几十个,因此出现基因完全不同也比较正常。