HMMcopy


Bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data. The R component can now be downloaded from Bioconductor.

Description

HMMcopy is a suite of tools to make copy number estimations for whole genome data with GC and mappability correction, then segment and classify copy number profiles with a robust Hidden Markov Model. Designed to work with high coverage whole genome HTS data, it can derive copy number estimations from large (~250GB) aligned BAM files in 2 hours on a single core with minimal memory requirements.

There are 2 sets of tools in the HMMcopy-suite:

  1. C++ based programs for analyzing BAM files and preparing read counts
  2. R package for correcting read counts and performing segmentation.  The HMMcopy R package can be installed using Bioconductor.

Methodology and citing HMMcopy
We are currently preparing the manuscript for HMMcopy. In the meantime, please refer to the supplemental material in the APOLLOH paper as it contains the most detailed explanation available at this time.

G. Ha, A. Roth, D. Lai, A. Bashashati, J. Ding, R. Goya, R. Giuliany, J. Rosner, A. Oloumi, K. Shumansky, S.-F. Chin, G. Turashvili, M. Hirst, C. Caldas, M. A. Marra, S. Aparicio, S. P. Shah. Integrative analysis of genome-wide loss of heterozygosity and mono-allelic expression at nucleotide resolution reveals disrupted pathways in triple negative breast cancer. Genome Research 2012 Oct;22(10):1995-2007. Download paper

If you have questions regarding this software, please contact
Gavin Ha (gha [at] bccrc [dot] ca)
or
Daniel Lai (jujubix [at] cs [dot] ubc [dot] ca)

Downloads

Request for Software Download here.

  • Added functionality: Filter duplicate reads in BAM files that have been dupsFlagged
  • Added files: hg19 mappability and hg19 gc content

HMMcopy-vignette
HMMcopy-manual

Installation

Package was designed to minimize dependencies and maximize portability, and should compile and run on all platforms provided the following:

    1. Requirements

    2. Build binary

tar xfzv HMMcopy.tar.gz
cd HMMcopy
cmake .
make

    1. Install HMMcopy R package using Bioconductor

R-2.11.0 or higher version is required.  Refer to the HMMcopy Bioconductor homepage for more information.

source("http://bioconductor.org/biocLite.R")
biocLite("HMMcopy")

Demo

R
> # Following commands in R
> library(HMMcopy)
> example("HMMcopy-package")

Usage

All programs and scripts have help guides when called with -h-help or no arguments. The genome of interest is split into non-overlapping windows of fixed width (i.e. bins), and statistics are gathered for each bin. Out of consideration of space and memory, there is a heavy emphasis on the usage and recycling of pre-computed files, and the ability to run on a single core with minimal memory requirements.

    1. Inputs

    2. Obtain readcounts for bins (compute for each BAM file)

bin/readCounter <TUMOUR BAM file> > tum_reads.wig
bin/readCounter <NORMAL BAM file> > norm_reads.wig

Can handle a 250GB BAM file on a single core in 2 hours. Bin size should NOT affect runtime. Will require a samtools BAM index, instructive errors will print if not found, you may build one with option -b, which may take a couple of minutes.

    1. Obtain GC content for bins (precomputed once for the same reference and bin size)

bin/gcCounter <FASTA reference> > gc.wig

Should take about a minute for the human genome at default settings. We have generated this file using the NCBI36 (release 54; hg18) reference. You can find this file at the path data/gc.wig.

Ensure that the resultant number of bins matches those for readcounts. This typically involves generating bins using using same reference genome with the same chromsomes. Option -c can be used to specify chromosomes, and option -l can be used to list names and lengths of all chromosomes in the FASTA reference. Counting the number of lines (e.g. with unix ‘wc’) is a quick way to check that the number of bins are the same in .wig files, as each bin is one a single line.

  1. Obtain average mappability for bins (precomputed once for the same reference and bin size)

      • If you DON’T have a BigWig mappability file

        See below: Getting a Mappability File

      • If you DO have a BigWig mappability file
    bin/mapCounter <BigWig file> > map.wig
    
    

    Should take about 2 minutes for the human genome at default settings.

    Like GC content bins, ensure that the number of bins match those for readcounts, also using options -c and -l (see step above) to specify chromosomes if required in the BigWig file.

  2. Correct readcount and estimate copy number

    R
    > # Below commands in R
    > library(HMMcopy)
    > tum_uncorrected_reads <- wigsToRangedData("tum_reads.wig", "gc.wig", "map.wig")
    > norm_uncorrected_reads <- wigsToRangedData("norm_reads.wig", "gc.wig", "map.wig")
    > tum_corrected_copy <- correctReadcount(tum_uncorrected_reads)
    > norm_corrected_copy <- correctReadcount(norm_uncorrected_reads)
    
    

    Should take no longer than a few minutes on a human genome.

    The correctReadcount requires at least about 1000 bins to work properly.

  3. Segmentation

    > # Below commands in R
    > # Normalizing Tumour by Normal
    > tum_corrected_copy$copy <- tum_corrected_copy$copy - norm_corrected_copy$copy
    > # Segmenting
    > param <- HMMsegment(tum_corrected_copy, getparam = TRUE) # retrieve converged parameters via EM
    > param$mu <- log(c(1, 1.4, 2, 2.7, 3, 4.5) / 2, 2)
    > param$m <- param$mu
    > segmented_copy <- HMMsegment(tum_corrected_copy, param) # perform segmentation via Viterbi
    
    

    See package vignette for more details on parameter setting:

    > # Below command in R
    > vignette("HMMcopy")
    
    
  4. Export

    > # Commands in R following
    > # Export to SEG format for CNAseq segmentation
    > rangedDataToSeg(tum_corrected_copy, file = "tum_corrected_copy.seg")
    
    
  5. Visualization

  6. > # Commands in R following step d and e)
    > plotBias(tum_corrected_copy)
    > plotCorrection(tum_corrected_copy)
    > plotSegments(tum_corrected_copy, segmented_copy)
    
    

Getting a Mappability File

If the genome of interest you’re working with is relatively small (i.e.<= 100MB), building a mappability file from scratch (Option B) would probably be the recommended method, if you’re working with the human genome (>= 3GB), chances are there are pre-computed files readily available (Option A). If you insist on (or are forced to) building a mappability file from a human sized genome, a rather powerful computer is required (>= 20GB RAM, 1 core).

A) Get one from somewhere else,

but ensure the FASTA reference is identical, e.g.: Homo_sapiens.NCBI36.54.dna.chromosomes.fa.map.bw
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeMapability/
BigWig files have extension .bw, but WIG files (extension .wig) are also acceptable
In the case of WIG files, convert to BigWig format with

util/bigwig/wigToBigWig input.wig  output.bw

wigToBigWig will require a file of chromosome sizes, obtainable via

bin/gcCounter  -l

B) OR: Generate your own from scratch

Requires: perl (http://www.perl.org/), bowtie (http://bowtie-bio.sourceforge.net/index.shtml)

util/mappability/generateMap.pl  > mappability.bw

Takes about 40 minutes for a 100MB reference on a single core, and about 20 hours for the human genome (seems to scale linearly). WARNING: takes about the same amount of memory… so that’s 20GB RAM required for the human genome.
It also requires a bowtie index for your FASTA file, which you may build with default settings through the -b flag, build manually with bowtie-build, or obtain a pre-build index if available (http://bowtie-bio.sourceforge.net/index.shtml)