ARAP: Automated RNA-seq Analysis Pipeline

ARAP is a software that generates an analysis pipeline for differential gene expression analysis of RNA-seq data. The program is writen in PERL. It is friendly to both novice and expert users.

Download:
By downloading the following software, you agree that you will NOT use the software other than reviewing the manuscript purpose.
ARAP.tar

Feedback and Suggestions:
Gao Lab, ray.x.gao_at_gmail*dot*com

Thank you for trying/using our software.

Prerequisites:
The programs and software are required to be installed before running ARAP:

We use human hg19 reference genome as an example to describe how to use ARAP. Before running ARAP, users need to download three files which will be used by ARAP:

    (1). The reference genome sequences in FASTA format;
    (2). The index files of the reference genome sequences for Bowtie 2.
    (3). The transcript annotation file in GTF format.

Users can download these files using the link below which is provided by Tophat:
        http://ccb.jhu.edu/software/tophat/igenomes.shtml

Users can easily download all the files for the Homo sapiens UCSC hg19 through this link.

Installation:

(1). Download the ARAP.tar file from the website:
(2). Extract the file:
        shell> tar xvf ARAP.tar
        change into the directory:
        shell> cd ARAP/
    You will see five files and one folder:
        The five files are:
            1). ARAP.pl
            2). dataTable.txt
            3). config.ini
            4). cluster.conf
            5). MANUAL.txt
        The one folder is:
            testData
        In the folder /testData, there are several small datasets which users can use to test ARAP.

Single vs Cluster Computing:

    (1). The ARAP can run on both a single Linux environment and in a clustered Linux environment.
    (2). By default, ARAP is configured to run on a single Linux environment (parallel=0 in the config.ini file).
    (3). To run it on cluster environment, users need to specify parallel=1 in the config.ini file. Users also need to edit cluster.conf file with the correct cluster configuration header according to the configuration requirement of the clusters.

Update "dataTable.txt" and "config.ini" file:

Before running ARAP, you must update "dataTable.txt" and "config.ini" with values that are relevant for your analysis.
        (1). Update the "Directory" field in dataTable.txt with the path to the directory that contains the raw sequence data in FASTQ format.
        (2). Edit config.ini and make sure all the paths for programs and reference files are correct. Also, users can specify the parameters for each tool in config.ini file.

HOW TO run ARAP analysis:

Command:
    shell> cd ARAP
    shell> perl ARAP.pl -d $PWD/dataTable.txt -c config.ini -o $PWD

This command will generate shell scripts for all samples.

Note: After running this command, you will generate a serial shell scripts that divide the pipeline into 4 steps. You can use nohup sh …. & to submit these shell scripts to run.

There are 4 steps to run ARAP, the 4 steps are submitted to run as below:

(1). QC and Mapping:
        shell> nohup sh QC_mapping_controlA_controlA_ACCACTG_L001_001.sh &
        shell> nohup sh QC_mapping_controlB_controlB_GCAAGAC_L001_001.sh &
        shell> nohup sh QC_mapping_thrombinA_thrombinA_AGATCCC_L001_001.sh &
        shell> nohup sh QC_mapping_thrombinB_thrombinB_GTGAAGC_L001_001.sh &
        shell> nohup sh QC_mapping_thrombinC_thrombinC_GGCTACG_L001_001.sh &

    (2). Calculation of FPKM and read count

        shell> nohup sh CufflinksHtseq_controlA.sh &
        shell> nohup sh CufflinksHtseq_controlB.sh &
        shell> nohup sh CufflinksHtseq_thrombinA.sh &
        shell> nohup sh CufflinksHtseq_thrombinB.sh &
        shell> nohup sh CufflinksHtseq_thrombinC.sh &

    (3). Differential gene expression analysis with Cuffdiff
        shell> nohup Cuffdiff.sh &

    (4). Differential gene expression analysis with DESeq and edgeR, and gene ontology and pathway analysis with GOSeq

        shell> nohup sh R.sh &

    Note: This provides users with a simple way to select the analytical step to perform (For example, if users only want to run QC and mapping, users just need to submit the first step).

Output files:

There are 4 steps to run ARAP, and each step will produce some significant output files:

    (1). QC and Mapping:
        1). The output files of QC for each sample will be saved in the sub-directory /fastqc of each sample-specific directory. Users can check this directory for Quality control for each sample.
        2). The mapping result will be saved in each sample-specific directory. The file name is something like tophat_*****.bam

    (2). Calculation of FPKM and read count:
        1). The output files of calculation of FPKM will be saved in each sample-specific directory. The file name is genes.fpkm_tracking.
        2). The output files of calculation of read count will be saved in each sample-specific directory. The file name is something like *****merged.nsorted.count

    (3). Differential expression analysis with Cuffdiff:
        The differential gene expression analysis using Cuffdiff will be saved in the $PWD directoty. The file name is gene_exp.diff. Users can use Excel to open it and find out the differential expression genes using P value cutoff.

    (4). Differential expression analysis with DESeq and edgeR, and gene ontology and pathway analysis:
        1). The differential expresion analysis with DESeq will be saved in the $PWD directory. The file name is DESeq_simpleDesignResults.txt. Users can use Excel to open it and find out the differential expression genes with P values.
        2). The differential expresion analysis with edgeR will be saved in the $PWD directory. The file name is edgeR_simpleDesignResults.txt. Users can use Excel to open it and find out the differential expression genes with P values.
        3). The gene ontology result will be saved in the $PWD directory. The file name is GO_final_result_sorted.txt. Users can use Excel to open it and find out the differential expression genes with P values. The KEGG IDs is saved in the $PWD directory. The file name is KEGG_ids.txt.