TITAN – Details for running TitanCNA R package

TITAN Home | Downloads | Installation | TITANRunner Pipeline | TitanCNA R package | Output | FAQ

For users who wish to learn more about using TitanCNA within R, here are further details of the functions provided by the package.

Demo TitanCNA

 ~$ [R dir]/bin/R # Following commands in R 
library(TitanCNA) 
example(TitanCNA)

Detailed example of TitanCNA

TITAN estimates the cellular prevalence (proportion of tumour cells with the tumour genotype) based on a fixed number of clonal clusters. Users are required to run TITAN on a tumour sample for a range of clonal clusters in parallel. We advise users to use a range of 1 to 5 but up to as many as 10 is also reasonable. When results for the 5 jobs are generated, one for each fixed number of clusters, each parameter output file (see step 12b) will contain an S_Dbw Validity Index score. The run for the tumour sample that has the minimum S_Dbw score is the optimal result and the corresponding number of clonal clusters is the optimal one. Here are the steps to run TITAN in R for the provided example. Note that this example is only for chr2 and fixed at 2 clonal clusters. For real data, users will run this same code five times, once for each possible number of clonal clusters, ranging from 1 to 5.

TitanCNA can be used to analyze both whole genome (WGS) and exome (WES) sequencing data. The following example demonstrates the analysis for WGS data.  For WES, there is only one step in the analysis that is different.  See Step 7 for details.

1. Load the TitanCNA library

Within an R session (R-3.0.2 or higher), load the installed TitanCNA library. This will automatically load “foreach” and “HMMcopy”.

 ~$ [R dir]/bin/R
library(TitanCNA)

2. Load the parameters

Load the default using a specified maximum copy number that TITAN will consider. Here, we use 5 but up to 8 copies can be used. Here, the number of clonal clusters (1, 2, 3, 4, or 5) to use for the TITAN run is specified.

numClusters <- 2
params <- loadDefaultParameters(copyNumber=5,numberClonalClusters=numClusters)

## NEW IN V1.5.5 ## Use the input data to inform the baseline allelic ratios when treating genotypes as symmetric (TRUE by default). Note that you will need to perform Step 6 first in order to load the data.

data <- loadAlleleCounts(infile, symmetric=TRUE)
params <- loadDefaultParameters(copyNumber=5, numberClonalClusters=numClusters, symmetric=TRUE, data=data)

3. Using custom parameters

Users can try different parameter settings for their analysis. See the FAQ page for examples of settings that can help with common issues.

a. Normal proportion initialization

params$normalParams$n_0 <- 0.5    #set initial normal proportion to 0.5

b. Tumour ploidy initialization

params$ploidyParams$phi_0 <- 2    #set initial ploidy to 2 (diploid)
params$ploidyParams$phi_0 <- 4    #set initial ploidy to 4 (tetraploid)

4. Running with parallelization (optional)

We recommend using the built in parallelization within R. There are several R packages that allow for this. The popular ones are “doMC”, “doMPI”, “doSNOW”. Here’s an example of using “doMC” (after installation) with 8 cores specified.

 library(doMC)
registerDoMC()
options(cores=4)

5. Input files

TitanCNA requires 2 main input files: a) tumour allele counts and b) tumour and normal wig files

a. Generating the tumour allele counts file

This file contains the tumour read counts for germline heterozygous SNP positions. See TITANRunner workflow description for the steps. The steps to generate this file are the following:

i. The positions of interest are identified as heterozygous SNPs from the matched normal sample.

If users wish to perform this step manually (instead of using TITANRunner), one solution is to run SAMtools/bcftools (v1.1) and SNPSift filter outside of R.

samtools mpileup -uv -I -f ref.fasta norm.bam | bcftools call -v -c - | java -jar SnpSift.jar filter "isHet( GEN[0] )" > normHetSNPs.vcf
Because BAM files can be large, this step may take a long time. Users should consider parallelizing this task by chromosomes.
samtools mpileup -uv -I -f ref.fasta -r $chrToUse -l dbSNP.positions.txt norm.bam | bcftools call -v -c - | java -jar SnpSift.jar filter "isHet( GEN[0] )" > normHetSNPs.vcf

Alternatively, if users do not want to use SNPSift, then grep commands can be applied instead.
samtools mpileup -uv -I -f ref.fasta -r $chrToUse -l dbSNP.positions.txt norm.bam | bcftools call -v -c - | grep -e "#" -e "0/1" > normHetSNPs.vcf

norm.bam is the path of the matched normal BAM file
ref.fasta is the path to the reference genome
normHetSNPs.vcf is the path to the BCF output
$chrToUse is the chromosome (e.g. 1 or chr1, depending on your reference genome)
dbSNP.positions.txt is a 2-column (chr, position), tab-delimited file containing positions of interest (optional).

ii. For each of these positions, extract the tumour read counts using the R package

Rsamtools

This step can be accomplished in R using the Bioconductor package, Rsamtools (version >= 1.17.11). We provide an R script that users can run without any modifications. Users need only specify the tumour BAM file path, tumour BAM index file path, normHetSNPs.vcf (generated in Step 5a, part i), and the output file name.

Click here to download the R script.

This script has a runtime of 15-20 minutes for a single chromosome (Rsamtools pileup() command) and has a negligible memory footprint.

## NEW IN V1.2.0 ## The steps in this script have been included in TitanCNA v1.2.0 as a function, extractAlleleReadCounts

extractAlleleReadCounts(bamFile, bamIndex, positions, outputFilename = NULL, pileupParam = PileupParam())

The format of the tumour allele counts file consists of 6 columns:

  1. Chromosome
  2. Position – heterozygous germline SNP positions (identified from the matched normal)
  3. Reference base – {A,T,C,G}; this can be arbitrary
  4. Reference count – number of aligned reads that match the reference base
  5. Non-reference base – {A,T,C,G}; this can be arbitrary
  6. Non-reference count – number of reads aligned that match the non-reference base
head test_alleleCounts_chr2.txt
chr	position	ref	refCount	Nref	NrefCount
2	13256	C	7	X	2
2	36899	T	3	X	6
2	55802	T	13	X	12
2	56232	A	3	X	7

b. Tumour and normal WIG files

The tumour/normal wig files consists of read counts for equal-sized bins (e.g. 1kb). This file can be generated using the HMMcopy suite. See Step 7 for more details.

c. Example: chromosome 2 of a breast cancer sample

For this example, we have provided the input files generated using the Python ruffus TITANRunner pipeline; only chromosome 2 is given. For real data with 22+ chromosomes, nothing extra needs to be done (until plotting). Within R, load the example using the following commands:

id <- "test"
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA")
tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA")
normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA")
gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA")
map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA")

6. Load the tumour allele read counts

Load the main input file by providing the filename, infile. This file is generated using the Python ruffus TITANRunner pipeline.

data <- loadAlleleCounts(infile)

Users can specify the format of the chromosomes as well. For example, UCSC uses “chr1” while NCBI uses “1”. Default is NCBI.

data <- loadAlleleCounts(infile, genomeStyle = "UCSC")

7. Correct GC content and mappability biases

Correct GC content and mappability biases by using a wrapper function that calls HMMcopy specific routines. In Step 5, we provided the files for chr2, but you will need to generate some of these files for your sample. The wrapper will apply the correction sub-routines and compute the log ratio, log2(tumour:normal). Four wig files are required. First, the data wig files for 1) tumour and 2) normal are generated using the HMMcopy suite or by the Python ruffus TITANRunner pipeline. Next, the GC content and mappability files specifically for the NCBI build 37 (hg19) (GRCh37-lite.fasta) reference are provided with the TITANRunner suite (see Downloads). Users may need to generate files for the reference genome you are using in the analysis (see the HMMcopy page).

  1. GC content – GRCh37-lite.gc.ws_1000.wig
  2. Mappability scores – GRCh37-lite.map.ws_1000.wig

In R, use the following code,

cnData <- correctReadDepth(tumWig,normWig,gc,map)

Users can specify the format of the chromosomes as well. For example, UCSC uses “chr1” while NCBI uses “1”. Default is NCBI.

cnData <- correctReadDepth(tumWig,normWig,gc,map, genomeStyle = "UCSC")
Whole exome sequencing data

For whole exome sequencing data, there will be regions of the genome with no coverage. To account for this, users can indicate the region boundaries (e.g. exons) for which read coverage is expected.

cnData <- correctReadDepth(tumWig,normWig,gc,map, targetedSequence = exons)

exons is a data frame containing the boundaries. The first 3 columns must be: chr, start, stop.

8. Assign copy number to each position

Find the log ratio at each position of interest (germline heterozygous SNPs) given in the object data. Then, transform the log ratios to natural logs instead of log base 2 (we will transform these back later). Remove logR and cnData objects to save memory.

logR <- getPositionOverlap(data$chr,data$posn,cnData)
data$logR <- log(2^logR)
rm(logR,cnData)

9. Filter the data

Filter out low and high depth positions and positions with NA’s.

data <- filterData(data,c(1:22,"X"),minDepth=10,maxDepth=200)

Users can also choose to remove positions that are highly mappable.

mScore <- as.data.frame(wigToRangedData(map))
mScore <- getPositionOverlap(data$chr,data$posn,mScore[,-4])
data <- filterData(data,c(1:22,"X"),minDepth=10,maxDepth=200,map=mScore,mapThres=0.8)

10. Parameter estimation

TITAN uses the Expectation Maximization algorithm (EM) and forwards-backwards algorithm. This step requires parallelization (by chromosome) to increase run-time performance.

convergeParams <- runEMclonalCN(data,gParams=params$genotypeParams,nParams=params$normalParams,
                                pParams=params$ploidyParams,sParams=params$cellPrevParams,
                                maxiter=20,maxiterUpdate=1500,txnExpLen=1e15,txnZstrength=1e5,
                                useOutlierState=FALSE,
                                normalEstimateMethod="map",estimateS=TRUE,estimatePloidy=TRUE)

Here is a little more detail regarding some important arguments:

  • maxiter = maximum number of EM iterations allowed. In practice, it does not exceed 20.
  • maxiterUpdate = maximum number of coordinate descent iterations during the M-step when parameters are estimated
  • txnExpLen = influences prior probability of genotype transitions in the HMM. The larger the value, the lower tendency to change state; however, too small and it produces underflow problems.  ## in version v1.2.0, underflow will not occur ##
  • txnZstrength = influences prior probability of clonal cluster transitions in the HMM. Larger values means lower tendency to change clonal cluster state.
  • useOutlierState = logical indicating whether an additional outlier state should be used. In practice, this usually is not necessary.
  • normalEstimateMethod = specifies how to handle normal proportion estimation. To estimate, use “map” which is maximum a posteriori. If you wish to not estimate this parameter, then use “fixed”. This will default the normal proportion to whatever is specified in “params$normalParams$n_0″. For example, if you know that this sample has absolutely no normal contamination, then use params$normalParams$n_0 <- 0 and set normalEstimateMethod=”fixed”.
  • estimateS = logical indicating whether to account for clonality and estimate subclonal events
  • estimatePloidy = logical indicating whether to estimate and account for tumour ploidy

11. Optimal genotype/clonal cluster inference

The Viterbi algorithm is used to determine the optimal genotype/clonal cluster state path. After running EM, use the converge parameters and the input data to compute the optimal segmentation results.
> optimalPath <- viterbiClonalCN(data,convergeParams)

If users would like to control (to an extent) the number of segments, then the transition probabilities can be altered.
> convergeParams$txnExpLen <- 1e12
> #default 1e9 (see Step 10); can increase for even stronger self-transitions
> optimalPath <- viterbiClonalCN(data,convergeParams)

In general, this step is less intensive than the Parameter Estimation (Step 10). Therefore, if users were using parallelization in Step 10, then switching to one CPU is recommended for this step.

## for example, if using doMC
options(cores=4)

12. Output formatted results

Output the results to flat files. Two files are generated:

  • a. Position-specific results that includes genotype, clonal cluster, and cellular prevalence estimates. The posterior probabilities at each position can optionally be returned.
outfile <- paste("test_cluster0",numClusters,"_titan.txt",sep="")
results <- outputTitanResults(data,convergeParams,optimalPath,filename=outfile,posteriorProbs=F)

## NEW IN V1.2.0 ##

For numClusters <= 2, users can append predicted subclone profiles to the output.

results <- outputTitanResults(data,convergeParams,optimalPath,
                                     filename=outfile,posteriorProbs=FALSE,subcloneProfiles=TRUE)
  • b. Model parameters and summary values such as the number of clonal clusters and their corresponding cellular prevalence, normal contamination, ploidy, and the S_Dbw validity index for model selection later.
outparam <- paste("test_cluster0",numClusters,"_params.txt",sep="")
outputModelParameters(convergeParams,results,outparam)

13. Plot the results

Plot the results using 3 built in functions. For each chromosome, plot the log ratios (CNA), allelic ratios (LOH), and cellular prevalence. Again, this example is for chromosome 2. For real data, users will want to use a loop to plot all chromosomes. Using the plotting functions in this manner (3 subplots) requires plotting one chromosome at a time. Alternatively, if the second argument is set to NULL (ie. chr=NULL), then a genome wide plot is generated. This only works when chr1-22,chrX are in the analysis.

norm <- convergeParams$n[length(convergeParams$n)]
ploidy <- convergeParams$phi[length(convergeParams$phi)]
#library(SNPchip)  ## use this library to plot chromosome idiogram (optional)
png(outplot,width=1200,height=1000,res=100)
par(mfrow=c(3,1))
plotCNlogRByChr(results, chr, ploidy=ploidy, geneAnnot=NULL, spacing=4,ylim=c(-4,6),cex=0.5,main="Chr 2")
plotAllelicRatio(results, chr, geneAnnot=NULL, spacing=4, ylim=c(0,1),cex=0.5,main="Chr 2")
plotClonalFrequency(results, chr, normal=tail(convergeParams$n,1), geneAnnot=NULL, spacing=4,ylim=c(0,1),cex=0.5,main="Chr 2")
if (as.numeric(numClusters) <= 2){ 
       ## NEW IN V1.2.0 ##
       ## users can choose to plot the subclone copy number profiles for <= 2 clusters
	plotSubcloneProfiles(results, chr, cex = 2, spacing=6, main="Chr 2")
}
#pI <- plotIdiogram(chr,build="hg19",unit="bp",label.y=-4.25,new=FALSE,ylim=c(-2,-1))
dev.off()

14. Generate the segment file

Generate the segment and IGV compatible .seg files using a Perl script

titan-runner/scripts/createTITANsegmentfiles.pl -id=test_cluster02 -infile=test_cluster02_titan.txt -outfile=test_cluster02_segs.txt -outIGV=test_cluster02.seg