假期姑且总结一下……
1.RNA_seq数据下载
在NCBI GEO
DataSet中选择GSE107744作为所选数据集。该数据为衰老酵母MEP细胞的野生型(wt)、swd1、spp1、set1CD四种类型分别进行YPD 0h/ 7.5h/ 24h/ 48h。芯片从GSM2877741到GSM2877794共计54张芯片。
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107744
获取每张芯片对应的SRR编号

对应到EBI数据库中,获取各SRR对应的ftp

下载并配置Aspera的运行环境
Aspera是高速下载的工具,最快可达到100M/s。
进入Aspera官网的Downloads界面,选中aspera connect
server,点击Wwindows图标,选择v3.6.2版本,点击Download进行下载。
配置Aspera的运行环境
#安装aspera connect |
2.FASTQC质控分析
#利用fastqc分析所有质控 |


图 1 序列质量分布 图 2 序列质量打分


图 3 GC值情况 图 4 N值情况


图 5 序列重复水平 图 6 过表达情况
Trimmomatic序列修剪
for i in `seq 197 254`
do
java -jar ./trimmomatic-0.39.jar SE -threads 6 -phred33 -trimlog SRR6350${i}_trimmomatic.log SRR6350${i}.fastq new_SRR6350${i}.fastq HEADCROP:11
done运行Trimmomatic bash trimmomatic.sh
TopHat2+cufflinks流程
4.1 tophat2序列比对
下载tophat2,解压,将文件软连接到环境变量中
sudo ln -s /mnt/g/RNA_seq/tophat-2.1.1.Linux_x86_64/tophat2
/usr/local/bin/tophat2
for i in `seq 197 254` |
运行tophat nohup ./toptat2.sh &
结果回显

4.2 Cufflinks序列组装
Cufflinks安装,将文件软连接到环境变量中
./configure --prefix=/path/to/cufflinks/install --with-boost=/usr/local/ --with-eigen=/usr/local/include//Eigen/ |
注意,Cufflinks软件安装需要在python2.7条件下使用,因此要切换默认python为2.7
cuffdiff -o diff_out_new -b cufflinks/genome.fa -p 8 -L set1CD_24h1,set1CD_24h2,set1CD_48h3,set1CD_48h4,set1CD_7.5h5,set1CD_7.5h6,set1CD_7.5h7,set1CD_0h8,set1CD_0h9,spp1_24h10,spp1_24h11,spp1_48h12,spp1_48h13,spp1_48h14,spp1_48h15,spp1_48h16,spp1_7.5h17,spp1_7.5h18,spp1_7.5h19,spp1_0h20,spp1_0h21,spp1_0h22,spp1_0h23,spp1_0h24,swd1_24h25,swd1_24h26,swd1_24h27,swd1_48h28,swd1_48h29,swd1_48h30,swd1_48h31,swd1_7.5h32,swd1_7.5h33,swd1_7.5h34,swd1_0h35,swd1_0h36,swd1_0h37,swd1_0h38,wt_24h39,wt_24h40,wt_24h41,wt_24h42,wt_24h43,wt_48h44,wt_48h45,wt_48h46,wt_48h47,wt_48h48,wt_7.5h49,wt_7.5h50,wt_7.5h51,wt_7.5h52,wt_7.5h53,wt_0h54,wt_0h55,wt_0h56,wt_0h57,wt_0h58 -u cufflinks/result.gtf ./tophat2_align/new_SRR6350197_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350198_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350199_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350200_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350201_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350202_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350203_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350204_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350205_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350206_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350207_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350208_thout_new/accepted_hits.bam ……(全部样本) |
---|
大概跑了5个小时后提示segmentation fault(core drump)
猜测是内存不够,只能跑了10个样本
cuffdiff -o diff_out -b cufflinks/genome.fa -p 8 -L C1,C2,C3,C4,C5,T1,T2,T3,T4,T5 -u cufflinks/result.gtf ./tophat2_align/new_SRR6350208_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350209_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350210_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350211_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350212_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350216_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350217_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350218_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350219_thout_new/accepted_hits.bam ./tophat2_align/new_SRR6350220_thout_new/accepted_hits.bam |
---|

从生成文档中选择genes.fpkm_tracking文档

内容回显

此为TopHat2+cufflinks流程的结果文档
5.1 Hisat2+stringtie
安装Hisat2、StringTie,将文件软连接到环境变量中
sudo ln -s /mnt/g/RNA_seq/hisat2-2.1.0-Linux_x86_64/hisat2-2.1.0/hisat2
/usr/local/bin/hisat2
sudo ln -s /mnt/g/RNA_seq/stringtie-1.3.5/stringtie /usr/local/bin/stringtie
for i in `seq 197 254` |
运行 nohup ./hisat2.sh &
5.2 Ballgown
Ballgown是R语言的程序包,在R studio中运行
# devtools::install_github('alyssafrazee/RSkittleBrewer', force=TRUE) |
查看各样本的表达情况
#设置颜色 |

图 7 各样本表达情况
就结果而言,各样本除224外的表达值相差不大
就单个转录本的查看在样品中的分布
ballgown::transcriptNames(bg_filt)[10] |

图 8 TDA8在样品中的分布
剪切方式分析
pdf('plotTranscripts.pdf',width=10, height=10) |

图 9 特定基因在特定样本中的表达情况
6.1 TopHat2 + Cufflinks流程
setwd("G:/RNA_seq/sra/diff_out") |

图 10 cuffdiff-heatmap
6.2 Hisat2+StringTie+Ballgown流程
fpkm<-read.table("fpkm.txt",header=T) |

图 11 Ballgown-heatmap
两组数据都有成功地将各自的差异表达数据类分离开,下面是两组数据所提取的差异表达值
#ballgown SFA1 YCG1 STE14 PAD1 FDC1 KTI11 QDR2 RPI1 KTR7 MPH1 MF(ALPHA)2 ATE1 ECT1 CPR2 IMD2 PRP21 MOG1 STE3 CTK1 CWP1 MUD2 RHO4 LAS1 BNA5 MGR3 TRS130 DSS1 PMS1 SMC5 ATX2 MF(ALPHA)1 SSU1 BTS1 VPS16 TAF3 KAR3 MATALPHA1 SKI3 OPT2 ARR1 CTR86 |
#cuffdiff RPP1A PDC1 RPL38 fba1 RPS15 RPL15A YDR524C-B ADH1 RPS21A RPS29B TPI1 RPS5 ENO2 RPP2A RPS27B RpS28a RPS25A RPS2 RPP2B RPL32 HSP12 RPS3 RPL31A SeD1 RPS12 RPS20 TDH2 RPL27A RPS1B RPS26A RPL3 RPS9B hyp2 RPL6A EFB1 RPP1B RPS22A RPL30 AHP1 RpS10a TDH3 RPS17A RPL26B rpl25 YDR524W-C SPO24 Rps10b RpS14a TMA19 RpL34b rps21b RPS13 GPM1 RPL29 RPS26B |
---|
![http://www.biovenn.nl/0.13014500/biovenn.png](21bb4a1cd1aba0239ba4a7d1cbb6f3ee.png %}
图 12 ballgown-cuffdiff
二者之间并没有交集,但这并不能说明两种方法完全不同。首先,二者进行差异表达所采用的方式不同,ballgown是t检验p<0.001,cuffdiff是t检验p<0.01
z检验p<0.01;其次,二者都是从几千个基因差异表达到几十个,因此出现基因完全不同也比较正常。