ARAP:
Automated RNA-seq Analysis Pipeline
ARAP is a software that generates an analysis pipeline for differential
gene expression analysis of RNA-seq data. The program is writen in
PERL. It is friendly to both novice and expert users.
Download:
By downloading the following software, you agree that you will NOT use
the software other than reviewing the manuscript purpose.
ARAP.tar
Feedback
and Suggestions:
Gao Lab, ray.x.gao_at_gmail*dot*com
Thank you for trying/using our software.
Prerequisites:
The programs and software are required to be installed before running
ARAP:
We use human hg19 reference genome as an example to describe how to use
ARAP. Before running ARAP, users need to download three files which
will be used by ARAP:
(1). The reference genome sequences in FASTA format;
(2). The index files of the reference genome sequences for Bowtie 2.
(3). The transcript annotation file in GTF format.
Users can download these files using the link below which is provided by Tophat:
http://ccb.jhu.edu/software/tophat/igenomes.shtml
Users can easily download all the files for the Homo sapiens UCSC hg19 through this link.
Installation:
(1). Download the ARAP.tar file from the website:
(2). Extract the file:
shell> tar xvf ARAP.tar
change into the directory:
shell> cd ARAP/
You will see five files and one folder:
The five files are:
1). ARAP.pl
2). dataTable.txt
3). config.ini
4). cluster.conf
5). MANUAL.txt
The one folder is:
testData
In the folder /testData, there
are several small datasets which users can use to test ARAP.
Single vs Cluster Computing:
(1). The ARAP can run on both a single Linux environment and in a clustered Linux environment.
(2). By default, ARAP is configured to run on a
single Linux environment (parallel=0 in the config.ini file).
(3). To run it on cluster environment, users need to
specify parallel=1 in the config.ini file. Users also need to edit
cluster.conf file with the correct cluster configuration header
according to the configuration requirement of the clusters.
Update "dataTable.txt" and "config.ini" file:
Before running ARAP, you must update "dataTable.txt" and "config.ini" with values that are relevant for your analysis.
(1). Update the "Directory" field
in dataTable.txt with the path to the directory that contains the raw
sequence data in FASTQ format.
(2). Edit config.ini and make
sure all the paths for programs and reference files are correct. Also,
users can specify the parameters for each tool in config.ini file.
HOW TO run ARAP analysis:
Command:
shell> cd ARAP
shell> perl ARAP.pl -d $PWD/dataTable.txt -c config.ini -o $PWD
This command will generate shell scripts for all samples.
Note: After running this command, you will generate a serial shell
scripts that divide the pipeline into 4 steps. You can use nohup sh ….
& to submit these shell scripts to run.
There are 4 steps to run ARAP, the 4 steps are submitted to run as below:
(1). QC and Mapping:
shell> nohup sh QC_mapping_controlA_controlA_ACCACTG_L001_001.sh &
shell> nohup sh QC_mapping_controlB_controlB_GCAAGAC_L001_001.sh &
shell> nohup sh QC_mapping_thrombinA_thrombinA_AGATCCC_L001_001.sh &
shell> nohup sh QC_mapping_thrombinB_thrombinB_GTGAAGC_L001_001.sh &
shell> nohup sh QC_mapping_thrombinC_thrombinC_GGCTACG_L001_001.sh &
(2). Calculation of FPKM and read count
shell> nohup sh CufflinksHtseq_controlA.sh &
shell> nohup sh CufflinksHtseq_controlB.sh &
shell> nohup sh CufflinksHtseq_thrombinA.sh &
shell> nohup sh CufflinksHtseq_thrombinB.sh &
shell> nohup sh CufflinksHtseq_thrombinC.sh &
(3). Differential gene expression analysis with Cuffdiff
shell> nohup Cuffdiff.sh &
(4). Differential gene expression analysis with
DESeq and edgeR, and gene ontology and pathway analysis with GOSeq
shell> nohup sh R.sh &
Note: This provides users with a simple way to
select the analytical step to perform (For example, if users only want
to run QC and mapping, users just need to submit the first step).
Output files:
There are 4 steps to run ARAP, and each step will produce some significant output files:
(1). QC and Mapping:
1). The output files of QC for
each sample will be saved in the sub-directory /fastqc of each
sample-specific directory. Users can check this directory for Quality
control for each sample.
2). The mapping result will be
saved in each sample-specific directory. The file name is something
like tophat_*****.bam
(2). Calculation of FPKM and read count:
1). The output files of
calculation of FPKM will be saved in each sample-specific directory.
The file name is genes.fpkm_tracking.
2). The output files of
calculation of read count will be saved in each sample-specific
directory. The file name is something like *****merged.nsorted.count
(3). Differential expression analysis with Cuffdiff:
The differential gene expression
analysis using Cuffdiff will be saved in the $PWD directoty. The file
name is gene_exp.diff. Users can use Excel to open it and find out the
differential expression genes using P value cutoff.
(4). Differential expression analysis with DESeq and edgeR, and gene ontology and pathway analysis:
1). The differential expresion
analysis with DESeq will be saved in the $PWD directory. The file name
is DESeq_simpleDesignResults.txt. Users can use Excel to open it
and find out the differential expression genes with P values.
2). The differential expresion
analysis with edgeR will be saved in the $PWD directory. The file name
is edgeR_simpleDesignResults.txt. Users can use Excel to open it and
find out the differential expression genes with P values.
3). The gene ontology result will
be saved in the $PWD directory. The file name is
GO_final_result_sorted.txt. Users can use Excel to open it and find out
the differential expression genes with P values. The KEGG IDs is saved
in the $PWD directory. The file name is KEGG_ids.txt.