TITAN – Running TITANRunner pipeline

TITAN Home | Downloads | Installation | TITANRunner Pipeline | TitanCNA R package | Output | FAQ

IMPORTANT: TITANRunner has been deprecated and has been replaced with a KRONOS based workflow. Please visit https://github.com/MO-BCCRC/titan_workflow for more details.
TITANRunner is a Python ruffus pipeline that performs all steps in the analysis including parsing BAM files, generating input files for TitanCNA analysis, generating output files and figures.

Users who wish to customize the TitanCNA R analysis can do so through this pipeline or by writing their own R scripts (see Details for TitanCNA R package)

IMPORTANT: TITANRunner can only be launched from the head node of a grid engine (cluster)

Please make sure to have all dependencies installed.  See the Installation page.

Pipeline workflow description

TITANRunner will accept a list of tumour-normal pair of BAM files. There are a 3 steps involved in generating 2 input files for analysis with TitanCNA. Then, TITANRunner will run TitanCNA, generating results in flat files and plot images.

  1. Identify germline heterozygous SNP positions in the matched normal BAM file.
  2. Extract the tumour allele read counts from the tumour BAM file at each of the germline heterozygous SNPs from Step 1. (Generates input file #1)
  3. Extract the tumour read depth from the tumour BAM file using HMMcopy suite. Correct GC content and mappability biases using HMMcopy R package. (Generates input file #2)
  4. Run TitanCNA, including generating figures for chromosome plots.

TITANRunner pipeline arguments

python TitanRunner.py -help
usage: TitanRunner.py [-h] -i INFILE --project-name PROJECT_NAME
                      --project-path PROJECT_PATH -c CONFIGFILE
                      [-r CUSTOM_RUN_ID] [--version]
                      [--platform {illumina,solid}]

TitanRunner v0.0.2

optional arguments:
  -h, --help            show this help message and exit
  -i INFILE, --infile INFILE
                        6 column (tab delimited) input file: tumourID
                        tumour_libID tumour_path normalID normal_libID
  --project-name PROJECT_NAME
                        name of the project
  --project-path PROJECT_PATH
                        specify results directory
                        manually enter run_id (used to re-launch a run)
  --version             show program's version number and exit
  --platform {illumina,solid}
                        set tech platform (default is illumina)

Launching the TITANRunner pipeline

An example command to launch the pipeline:

python TitanRunner.py -i my_tumour_normal_pairs.txt -c config_default.cfg --project-name my_samples --project-path /path/to/my/runner_results/


INFILE (Required)
my_tumour_normal_pairs.txt is a 6 column, tab-delimited input file. tumour_path and normal_path refer to the paths of tumour and normal BAM files, respectively. The tumourID and tumour_libID are identifiers for the tumour sample. The same identifier may be used if there is no separate library ID.

PROJECT_NAME (required)
A string denoting the project name.

PROJECT_PATH (required)
Full path specifying the location the results will be written to.

CUSTOM_RUN_ID (optional)
When this argument is not used, the runID is generated automatically for each run and is stored along with the project name in runID_database.txt. The runID will appear in the results directory name, TITAN results files, and the run log file.
This argument is used when the user wishes to re-run the pipeline for a failed run. Simply specify the same runID used previously. The pipeline will pickup where it was disrupted continue to completion without re-running previously completed intermediate steps.

See next section.

Details for setting up the CONFIGFILE

The settings for running the pipeline (including preprocessing and running TitanCNA R portion) are specified in the CONFIG file. A template for this is included in the distribution of TITANRunner (config_default.cfg).

Here, each section of the CONFIGFILE file is explained in more detail.

a. Cluster configurations

TITANRunner requires a grid engine/cluster setup. These settings will depend on the type of grid engine software used.
Here, users can specify the number of cores (within a node/machine) to use for the parallelization within the TitanCNA R component of the pipeline.

max_jobs = 500
qsub_statement = qsub -q all.q -sync yes -j yes -o {0} -now n -b yes -pe serial {1} -l mem_free={2} -V
# memory requirements
titan_mem_high = 5G
titan_mem_med = 2G
titan_mem_low = 500M
# number of cores to utilize on each node
titan_cores = 4

b. Software paths

Specify the paths to the 3 required tools: R, samtools, and bcftools. These dependencies should be installed prior to running TITANRunner (see Installation).

R = /path/to/R/executable/bin/R
samtools = /path/to/samtools/executable/samtools
bcftools = /path/to/bctools/executable/bcftools

c. Reference genome specific files

  1. reference_path specifies the reference genome used to align the tumour and normal sequenced reads
  2. map_path and gc_path specifies the reference GC content and mappability score files that will be used to normalize the tumour and normal read data. See the HMMcopy page for more details on how to generate these files, which are specific to the reference genome used to align the samples. We also provide these files for GRCh37 (hg19), divided into 1kb windows. See the Download page.

reference_path = /path/to/reference/genome/genome.fasta
map_path = /path/to/genome/map/wig/genome.map.wig
gc_path = /path/to/genome/gc/wig/genome.gc.wig
dbsnp_path = /path/to/dbsnp/reference/common_all_dbSNP138.vcf.gz

d. TitanCNA parameters

TITANRunner provides users the ability to change TitanCNA parameters directly in the CONFIGFILE. These parameters are passed to the underlying R script (titan-runner/scripts/titan.R) by TITANRunner when performing the TitanCNA analysis step.

Note: Advanced R users can use TitanCNA independently from the TITANRunner, and modify parameters directly in an R session or in their own scripts. See Details for running TitanCNA R package page.

# Maximum number of clonal clusters; default 5
# (i.e. runs TITAN 5 times, once each for 1 to 5 clusters)
num_clusters = 5
– This parameter allows the user to specify the maximum number of clonal clusters. TITANRunner will run the TitanCNA titan.R script multiple times. For example, for max number of clusters of 5 (default), TitanCNA is run 5 times, once for each fixed number of clusters ranging from 1 to 5.

# normal_params_n0 can be in the range of [0,1] inclusive
normal_params_n0 = 0.5
– Initialization of normal contamination parameter.

# normal_estimate_method can have values {fixed, map}
normal_estimate_method = map
– Specify whether to estimate the normal contamination parameter using maximum a posteriori (map) or do not estimate and leave fixed at normal_params_n0.

# initialize 2 for diploid; 4 for tetraploid
titan_ploidy = 2
estimate_ploidy = TRUE
– Initialization of the average tumour ploidy parameter. We recommend running TitanCNA for both diploid (2) and tetraploid (4) settings. Specify whether to estimate the ploidy parameter (TRUE) or leave fixed (FALSE).

# maxCN value should be [5,8]
maxCN = 8
– Specify the maximum copy number to consider in the analysis. When running TitanCNA initialized tetraploid setting (titan_ploidy=4), maxCN=8 should be used. When running initialized diploid setting (titan_ploidy=2), users can use maxCN=5 to reduce the state space and overall model complexity.

max_iters = 50
pseudo_counts = 1e-300
txn_exp_len = 1e9
txn_z_strength = 1e9
alpha_k = 2500
alpha_high = 20000
– Advanced TitanCNA settings. See Details for running TitanCNA R package page.

# sequencing platform allelic skew
titan_skew_illumina = 0
titan_skew_solid = 0.1
– Specify if the sequencing platform used is Illumina or SOLiD. In SOLiD data, a skew towards the reference allele may be observed. As a result, TitanCNA will account for this by shifting allelic ratio baseline by 0.1 (towards the reference). Users specify the platform directly when launching the TITANRunner (3. Launching pipeline)

# thresholds to use when extracting read counts
mapping_quality = 20
base_quality = 10
– Quality thresholds used when extracting TitanCNA input read counts at heterozygous germline SNP positions in the tumour.